Looking at SQLite Unicode behavior (posted 2023-09-05)
In this post, I want to have a look at how SQLite interacts with Unicode. (Also see my post The (dark) magic of Unicode.) As explained here, SQLite doesn't have full Unicode support unless that support is explicitly included when SQLite is compiled.
So what does this mean in practice?
On the (non-Windows) systems I have access to, SQLite uses UTF-8 to store text. UTF-16 is also supported. However, a limitation that's made very explicit is that changing from upper case to lower case or the other way around is only supported for 7-bit ASCII. So the 26 letters we know from the English alphabet.
Let's see for ourselves:
% sqlite3 test.db
sqlite> .mode qbox
sqlite> create table test (id integer primary key, text text);
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'O', 'A'));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', char(0xD6), char(0xC5)));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'O'||char(0x308), 'A'||char(0x30A)));
sqlite> insert into test (text) values (format('SM%sRG%sSBORD', 'U', 'A'));
We've now added four variations of the word "SMORGASBORD" to our test database table. And that worked:
sqlite> select * from test; ┌────┬─────────────────┐ │ id │ text │ ├────┼─────────────────┤ │ 1 │ 'SMORGASBORD' │ │ 2 │ 'SMÖRGÅSBORD' │ │ 3 │ 'SMÖRGÅSBORD' │ │ 4 │ 'SMURGASBORD' │ └────┴─────────────────┘
The first version uses an unaccented O and A. The second version uses the "composed" version of the accented letters. That means that a single Unicode character / code point encodes the letter/accent combination. The third version uses the regular letters followed by a "combining diacritical mark", a special Unicode character that adds an accent to the previous character. And then finally I added "SMURGASBORD" which will come in handy later.
(The combining diacritical marks throw off the SQLite text length / box size calculations, hence the misplaced vertical line at the end of row 3.)
Those upper case letters are a bit overbearing, though, so let's convert them to lower case.
sqlite> update test set text = lower(text);
sqlite> select * from test; ┌────┬─────────────────┐ │ id │ text │ ├────┼─────────────────┤ │ 1 │ 'smorgasbord' │ │ 2 │ 'smÖrgÅsbord' │ │ 3 │ 'smörgåsbord' │ │ 4 │ 'smurgasbord' │ └────┴─────────────────┘
And here SQLite's lower()/upper() limitation pops up. The first and last rows are converted without trouble because those only use the letters out of the regular Latin script alphabet. The second one uses letter/accent combinations outside of the 7-bit ASCII range so those letters are not converted to lower case. Interestingly, the third row did get converted. That's because the letter part is encoded using the regular O and A, which SQLite can convert to lower case. The diacritical combing marks that add the accents then apply to the now lower case o and a without missing a beat.
So the take home message is: don't let SQLite convert between upper and lower case.
In any event, the rules for how to do this properly vary by language. For instance, an English speaker may think the following makes sense:
istanbul = Istanbul ijsbeer = Ijsbeer één = Één
But that would be wrong. In Turkish, there is a dotted i and an un-dotted i, the ı. Ok, that's not so bad. Until you realize that i becomes İ and ı becomes I. And in Dutch an "ijsbeer" (polar (ice) bear) needs to become "IJsbeer" at the start of a sentence because IJ is considered a single letter. And, unlike for instance the French, we don't put accents on capital letters, so "één" (one) at the beginning of a sentence becomes "Eén". So these are correct:
istanbul = İstanbul ijsbeer = IJsbeer één = Eén
Also, because the composed and decomposed variations of smörgåsbord use different sequences of Unicode characters, if you compare the same word with those different encodings, they won't match. But that part can be solved by normalizing your Unicode string to either composed or decomposed, and then comparisons will work.
But then there's sorting to worry about.
sqlite> select id, length(text) as chars, octet_length(text) bytes, text from test order by text;
┌────┬───────┬───────┬─────────────────┐ │ id │ chars │ bytes │ text │ ├────┼───────┼───────┼─────────────────┤ │ 1 │ 11 │ 11 │ 'smorgasbord' │ │ 3 │ 13 │ 15 │ 'smörgåsbord' │ │ 4 │ 11 │ 11 │ 'smurgasbord' │ │ 2 │ 11 │ 13 │ 'smÖrgÅsbord' │ └────┴───────┴───────┴─────────────────┘
(We can see here that "length" shows the number of Unicode code points. "octet_length" is the length in bytes, which is different as soon as a text contains anything that isn't 7-bit ASCII.)
Considering what we discussed before, this makes perfect sense: "smo" and "smo" come before "smu" (the fact that the second "smo" is followed by a diacritical combining mark is not relevant when looking at the first three characters) and then "smÖ" comes after "smu" because Ö is a "high-ASCII" character while "u" is just a regular 7-bit ASCII character.
Of course this is not really a correct sorting, as that again depends on the language/locale. In some languages, letters with accents are considered letters in their own right with a specific place in the sorting order, while in other languages, the accent is simply ignored and äbc comes between abb and abd. And that's just for "general" sorting.
Name/phonebook sorting often has its own rules. For instance, so many people here in the Netherlands have a name starting with "van" (of) or "van de" (of the) that these prefixes are generally ignored for name sorting. And many names have an "ij" and a "y" form. So "van Beijnum" is found under B, and is sorted together with "van Beynum".
So these caveats are something to be aware of. But there is an escape hatch: you can compile SQLite with SQLITE_ENABLE_ICU to gain full Unicode support. But this makes SQLite a good deal bigger and slower. Dilemmas...