iljitsch.com

topics: BGP / IPv6 / more · settings · b&w · my business: inet⁶ consult · Twitter · Mastodon · LinkedIn · email

MySQL Unicode weirdness     (posted 2023-09-21)

After looking at the SQLite Unicode behavior, it's now time to do the same for MySQL. Coincidentally, I'm currently migrating some old databases that were created in the very early 2000s to a more modern environment. I think those old databases were from the MySQL 3.x days, before MySQL gained any sort of Unicode support. Those old tables are thus still in the latin1 (ISO 8859-1) character set.

For a long time that worked fine for me as these tables contain data in English and Dutch, and Dutch only needs a few accents, which are all present in Latin 1. However... at some point it started becoming more convenient to use UTF-8 for the text in those databases. So I did. I just never told MySQL about the switch.

In hindsight, that was not a terrible decision, as it avoided complexities at a time when UTF-8 support was still immature. But these days, there's really no excuse to do anything other than make an entire workflow UTF-8 clean, barring any showstoppers.

So I migrated an old table to the test environment for the new system. And got some really weird results: on web pages "CO₂" showed up as "COâ‚‚", but in MySQL it showed up correct, be it that the number of characters is off (I only got 6 while I specified 8):

mysql> select substr(article, 1587, 8) from muart where id = 753;
+--------------------------+
| substr(article, 1587, 8) |
+--------------------------+
| CO₂ ui                 |
+--------------------------+

Further digging by looking at the raw data:

mysql> select hex(substr(article, 1587, 8)) from muart where id = 753;
+-------------------------------+
| hex(substr(article, 1587, 8)) |
+-------------------------------+
| 434FC3A2E2809AE2809A207569    |
+-------------------------------+

Fortunately, it's not necessary to decode this manually, there is an UTF-8 Decoder web tool for that. The decoder shows that COâ‚‚ is correct. So why is MySQL showing me CO₂? That's a problem I hadn't heard of before, but is apparently not uncommon:

Double-encoded UTF-8.

This happens when you take UTF-8, such as "CO₂", and then pretend it's another encoding (usually Latin 1) and convert that to UTF-8. So what had happened is that as I was importing my data in MySQL, the MySQL command line client would send UTF-8 to the MySQL server process, but the server would think that was Latin 1 and convert it to UTF-8. Then when I did a query, the server would convert the UTF-8 back to what it thought was Latin 1 for the convenience of the client, and the client then showed this as UTF-8 so everything looked good, but was actually stored in the database incorrectly.

Fortunately, the two related problems were both easy enough to fix. First, make sure the client and server agree on the character encoding. Let's first check what the server's original settings are:


mysql> SHOW SESSION VARIABLES LIKE 'character_set%';
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | latin1                         |
| character_set_connection | latin1                         |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | latin1                         |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+

Wow. But fortunately we don't have to fix all of those individually. We can simply do:

mysql --default_character_set=utf8mb4

And then:

mysql> SHOW SESSION VARIABLES LIKE 'character_set%';
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb4                        |
| character_set_connection | utf8mb4                        |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb4                        |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+

Much better.

And convert the double-encoded UTF-8 into something more readable:

mysql> UPDATE muart SET title=CONVERT(CAST(CONVERT(title USING latin1) AS binary) USING utf8mb4);

Keen-eyed observers may have noted that MySQL has two versions of UTF-8: utf8mb3 and utf8mb4. This goes back to the early days of Unicode, where the idea was that all characters would fit into 16 bits. That results in a maximum of 3 bytes of UTF-8. But it soon became clear that 16 bits wasn't enough. So now it's 21 bits. UTF-8 can handle that just fine, but those characters that need 17 - 21 bits result in 4-byte UTF-8 sequences. So when dealing with MySQL, when you think "UTF-8", type "utf8mb4".

by .


Archives: 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024