Internationalised URLs

Created 5th December, 2007 12:16 (UTC), last edited 5th December, 2007 13:57 (UTC)

I'm pleased that Phil Haacked has written about URLs and hopefully raised awareness a bit more widely about the issues surrounding internationalising them.

He focusses on how to create URLs, which is only one half of the story. The other half is what happens to the URLs once they get out in the wild. Do they survive? I've had internationalised URLs on this site for a couple of years now, for example, the page on the Thai New Year, สงกรานต์, has been very popular the last two years.

The truth of the matter is that although the URLs work well in browsers and with the major search engines, a lot of other web sites can't handle them at all. There are two big problems.

Problem 1 — What characters can be in URLs anyway?

Too many developers working with URLs don't know how they work — even those who should know better. All of the characters below are legal to use in the path specification of a URL without encoding and without special meaning:

$-_.+!*'(),;:@&=

The number of parsers that get this wrong is just staggering. None of these should be used as delimiters in parsers that try to autolink URLs. This of course causes a problem when people put punctuation around the URL. There are two safe choices for the punctuation to delimit a URL: spaces and the angled brackets— < and >, but of course most end users won't think of this when writing.

Because those of us who create URLs know that there isn't really a good solution for this we tend to encode these characters anyway. Whatever you do when you process our URL, don't decode these characters and then decide that they really mean something else* [*Especially don't turn %2B into a + and then turn that into a space! Yes I'm looking at you Technorati and StumbleUpon (and countless others). The + is a substitute for spaces only in query strings, not in the file specification!], which brings me on to…

Problem 2 — What encoding is used?

The second problem is that the encoding format is not specified. Phil uses UTF-8 for his example, as do many people who know Unicode. Unfortunately many developers don't know Unicode and just use whatever encoding happens to be laying about as the default setting on their computer.

The major browsers are all converging on UTF-8 encoded URLs (finally), but a lot of the rest of the world uses other encodings and there is no way to specify in the URL which encoding is being used. This wouldn't matter if so many web sites didn't try to read the URL. A URL is an opaque string with meaning only to the server it addresses. Apart from splitting the path specification at forward slashes (which are expressly reserved for providing hierarchy) you should never change the URL. You don't know what it means and you don't know what it is even when you think you do. Do not try to correct my URL — you don't know anything about it!

Further reading

I'm also working on a full guide to handling URLs, but it isn't ready yet.


Categories: