Encoding problems

A nice paper typography by Caro Wallis

A nice paper typography by Caro Wallis

Storing, managing and displaying characters should be easy. It seems like this is not the case yet.

Encodings are a way to store and recall sequence of symbols like a string in a binary format. For instance, the ASCII encoding map the sequence “abc” to the binary format of 0110 0001 0110 0010 0110 0011 so we can send it through a network link, store it in a file or save it in memory.

Encodings are absolutely needed. We would not be able to work with text in a computerized way without them. They pose a problem because they are misunderstood and misused.

Encoding problems are not rare. They are not uncommon. They are endemic. They are so present in my daily life that I do not even pay attention to them anymore. They are present in small software up to the multimillion projects.

Here is the latest example of this problem.

One of my Amazon Canada bill. They cannot correctly spell my name.

One of my Amazon Canada bill. They cannot correctly spell my name.

One of my Amazon US bill. They know how to spell my name.

One of my Amazon US bill. They know how to spell my name.



Even Amazon has problems with encodings in different systems. No wonder is it a plague among other software too. Having a character out of English language symbol set in your name is a common way to spot this kind of problem.

In the Good Old Days, when computer had to run on quite limited space in terms of memory, storage and bandwith, there were many different encodings for different languages. In order to display or find out which symbols you could construct from a giving binary representation you had to know which encoding was used to encode them.

Nowadays, we have Unicode. Unicode is an attempt at mapping every known character symbol for every currently used languages. It even maps ancient and mostly instinct language symbols.

Unicode has many ways to encode its big table in a binary format. The common ones are UTF-8 and UTF-16. UTF-8 maximize compatibility with ASCII. A text containing strictly ASCII characters will have the same binary representation using the ASCII encoding or the UTF-8 encoding.

A common mistake is that the world lives in English. My recommendation is to take your next vacation week in a small town in Japan to find out if they live in English. Another common mistake is just plain not knowing that character symbols are encoded to binary format and decoded from binary format. I have known many programmers who did not know. I did not know before I had the same kind of problems that Amazon Canada currently have. Colleges and universities should teach this kind of stuff, but that is another discussion.

The most common mistake is going from one encoding to another without thinking about possible loss of precision. You cannot go from Unicode to ISO-8859-1 without possibly losing many characters in the process. Think about it. Unicode version 5.1 maps 100713 symbols while ISO-8859-1 maps 191 symbols. There is a big gap between these two.

Unicode is no silver bullet, but in doubt, you should use it. There is not many reasons why you should not use it. Applying it in every cases in every systems should be the norm.

Comments 1

  1. cbelisle wrote:

    Same problem here, but with my last name. So a long time ago, I took the decision that on the Internet, my name doesn’t contain accents.

    Unicode is not young (early 90’s) and it’s still not used in many cases. It’s sad…

    Posted 21 Jan 2009 at 3:58 pm

Post a Comment

Your email is never published nor shared. Required fields are marked *

CAPTCHA image