Name Sorting on this Website

Polish, like most European languages including English, is based on the Latin alphabet. To represent additional sounds, some of the letters get modified to create new letters. The new letters pretty much look like their unmodified forms. In Polish, these additional letters are: ąćęłńóśźż. How does one alphabetize the new letters? The Polish alphabetization rule (also called a collation) is to place the new letters after the unmodified form. So the Polish alphabetical order is: aąbcćdeęfghijklłmnńoóprsśtuwyzźż.

Now if you are like most Americans, you probably didn't know that. A Polish alphabetized list of words (or names) might cause problems for those who are unaware of the Polish collation. For example, someone looking for a place correctly spelled Łęczyca might conclude it doesn't exist if they only look among the L words. However if they look among the Ł words which come later, they would have found it. In an international environment where people don't always know the collation order, those of us interested in data handling adopt a new collation where modified letters are treated as being the same as their unmodified counterparts. If I may use math symbols here, in the Polish collation L<Ł but in the new (international) collation L=Ł. Most of the indexes on this website use this new (international) collation.

The above paragraphs contain the main take-aways of this article. The rest of this article deals with some historical/technical issues that I choose to document but are not of general interest.

In the early days of computers, there was the ASCII encoding of characters (the English alphabet, numbers, etc.). This encoding used only one byte per character. Because only 256 codes were possible, characters from all the other languages could not be accommodated. To represent characters of a particular language, some of the unused 256 codes were reassigned to represent them. One needed to specify which "code page" one was using to display the correct characters. Because there are many languages, there were many code pages. A browser that didn't have the correct code page displayed the wrong characters. This problem was largely eliminated with the universal coded character set, UTF-8. Each character is represented by as many as two bytes which allows for over 64,000 different characters. So now most text is encoded using UTF-8. The UTF-8 set defines the code a character should have but does not specify how the character should be sorted (collated).

This article starts by talking about collation which is used to sort words in a particular alphabetical order. I mentioned the new (international) collation and what it does. One such collation is utf8_general_ci (utf8 specifies the character encoding scheme, the general means it sorts letters by the unmodified letter it looks like, and ci means the sorting is case insensitive). The bad news is this collation has an error in it. The Polish Ł should be treated the same as an L but instead is treated as though it occurs after Z. So does that mean you should have been looking for Ł words after Z? If I actually sorted by that collation, the answer would have been yes. But when I realized the problem, I worked around it by creating another field where the word consisted solely of equivalent unmodified letters. I then sorted on that field but displayed the field with the Polish spelling (so the Ł words are filed among the L words).

Obviously Poles were not consulted about the general collation and its implementation. When Poles complained about the error, they were told it wasn't an error! Excuse me? I suppose if you are given bad information and act on it correctly then perhaps it's not an error in their eyes. What about a fix? The powers that be refused to fix the problem. Apparently they tried to fix some other error once and the fix caused all kinds of grief for those already using the broken collation. Instead, they now come up with new collations that fix the known errors. So why didn't I just use an updated collation and solve the problem? Web service providers decide what resources users are allowed access to and a better collation was not one of them. Just like the "fixed" collation caused problems, web service providers don't like to upgrade their systems to avoid "breaking" someone's website.

In January 2021 when this website was moved to a new server, an updated collation (utf8_unicode_520_ci) became available. I've found only one instance on this website (so far) where this new collation has been necessary. Otherwise, the workaround discussed above could have continued to be used.