Will the real revolution on the Internet please step forward? Welcome Unicode

Created 26th May, 2006 15:34 (UTC), last edited 17th June, 2006 10:02 (UTC)

Unicode has been with us for a while now. The first standard was published in 1991 CE with Unicode 1. Now we have Unicode 4.1.0 with Unicode 5 in beta. The IETF now specify that protocols must use UTF-8 RFC 2277) and it's already the standard encoding with XML and hence XHTML (barring some protocol hiccoughs with MIME types and default character sets).

Roll of honour

  • Technorati wouldn't know anything about Chinese bloggers if they didn't have their service working properly with Unicode.
  • Wikipedia has not only shown that anarchy can be useful, but just as importantly the English language site has shown that you can migrate a million pages of content from ISO 8859-1 to UTF-8.
  • IIS 6 defaults to UTF-8 URL encodings. To be fair of course it isn't just them, you're just more likely to run into it there.
  • All the major browsers for handling Unicode pretty well despite the encoding confusion.
  • All developers who know the difference between UCS-2 and UTF-16.

The list is incomplete of course. Let me know what else I should be adding.

Roll of shame

And then there's the dark side.

  • Hotmail and Yahoo mail still seem to enjoy throwing spurious rubbish all over UTF-8 encoded mails.
  • The number of developers who still think that Unicode is 16 bits.
  • Many of the social networking web sites have a variety of interesting problems handling UTF-8 encoded URIs* [*Many also have a very creative way of dealing with '+' signs. Yes, I mean Technorati, but that's a subject best explored at another time.].
  • Any system I see that spells my name Sælensminde, or decides that the page titles on my site end with — kirit.com.
  • Any programming API that returns a 16 bit value when fetching a single character from a Unicode string.

Moving on

Site developers clearly still have a lot to learn about Unicode. Issues with PostgreSQL (now fixed) has also shown up problems with site design systems like PHP. These issues are just the tip of the iceberg. It is this continuing aspect of encoding misunderstandings that is slowing the adoption of Unicode.

As for me, I'm trying to do my bit by using outrageous UTF-8 URIs to see what I can break. I've broken a lot so far, but as everybody I've contacted has either fixed and updated their systems or is waiting for their next roll-out I felt it unfair to name and shame. If you want to test your systems then see how you get on with linking to pages like สงกรานต์drop me a line to tell me how you get on.