Finally out of MySQL encoding hell [en]

[fr] Description de comment je me suis sortie des problèmes d'encodage qui résultaient en l'affichage de hiéroglyphes sur tous les sites hébergés sur mon serveur.

It took weeks, mainly because I was busy with a car accident and the end of school, but it also took about two real whole days of head-banging on the desk to get it fixed.

Here’s what happened: remember, a long time ago, I had trouble with stuff in my database which was [supposed to be UTF-8 but seemed to be ISO-8859-1](http://climbtothestars.org/archives/2004/07/18/converting-mysql-database-contents-to-utf-8/)? And then, sometime later, I had a [weird mixture of UTF-8 and ISO-8859-1 in the same database](http://climbtothestars.org/archives/2005/02/19/problemes-dencodage-mysql/)?

Well, somewhere along the line this is what I guess happened: my database installation must have been serving UTF-8 content as ISO-8859-1, leading me to believe it was ISO-8859-1 when it was in fact UTF-8. That led *me* to try to convert it to UTF-8 — meaning I took UTF-8 strings and ran them through a converter supposed to turn ISO-8859-1 into UTF-8. The result? Let’s call it “double-UTF-8” (doubly encoded UTF-8), for want of a better name.

Anyway, that’s what I had in my database. When we upgraded MySQL and PHP on the server, I suddenly started seeing a load of junk instead of my accented characters:

encoding-problem-2

What I was seeing looked furiously like UTF-8 looks when your server setup is messed up and serves it as ISO-8859-1 instead. But, as you can see on the picture above, this page was being served as UTF-8 by the server. How did I know it wasn’t ISO-8859-1 in my database instead of this hypothetical “double-UTF-8”? Well, for one, I knew the page was served as UTF-8, and I also know that ISO-8859-1 (latin-1) served as UTF-8 makes accented characters look like question marks. Then, if I wanted to be sure, I could just change the page encoding in Firefox to ISO-8859-1 (that should make it look right if it was ISO-8859-1, shouldn’t it?) Well, it made it [look worse](http://flickr.com/photos/bunny/179791233/).

Another indication was that when the MySQL connection encoding (in my.cnf) was set back to latin-1 (ISO-8859-1), the pages seemed to display correctly, but WordPress broke.

The first post on the picture I’m showing here looks “OK”, because it was posted after the setup was changed. It really is UTF-8.

Now how did we solve this? My initial idea was to take the “double-UTF-8” content of the database (and don’t forget it was mixed with the more recent UTF-8 content) and convert it “from UTF-8 to ISO-8859-1”. I had [a python script we had used to fix the last MySQL disaster](http://sungnyemun.org/code/fixEncodings.py.gz) which converted everything to UTF-8 — I figured I could reverse it. So I rounded up a bunch of smart people ([dda_](http://sungnyemun.org/), [sbp](http://inamidst.com/sbp/), [bonsaikitten](http://gentooexperimental.org/~patrick/weblog/) and [Blackb|rd](http://www.oydt.de/) — and countless others, sorry if I forgot you!) and got to work.

It proved a hairier problem than expected. What also proved hairy was explaining the problem to people who wanted to help and insisted in misunderstanding the situation. In the end, we produced a script (well, “they” rather than “we”) which looked like it should work, only… it did nothing. If you’re really interested in looking at it, [here it is](/code/fixDoubleEncodings.py) — but be warned, don’t try it.

We tried recode. We tried iconv. We tried changing my.cnf settings, dumping the databases, changing them back, and importing the dumps. Finally, the problem was solved manually.

1. Made a text file listing the databases which needed to be cured (dblist.txt).
2. Dumped them all: for db in $(cat dblist.txt); do mysqldump --opt -u user -ppassword ${db} > ${db}-20060712.sql; done
3. Sent them over to Blackb|rd who did some search and replace magic in vim, starting with [this list of characters](http://climbtothestars.org/play/characters.txt) (just change the browser encoding to latin-1 to see what they look like when mangled)
4. Imported the corrected dumps back in: for db in $(cat dblist.txt); do mysql -u user -ppassword ${db} < ${db}-20060712.sql; done

Blackb|rd produced [a shell script for vim](http://www.schwarzvogel.de/software-misc.shtml) (?) which I’ll link to as soon as I lay my hands on the URL again. The list of characters to convert was produced by trial and error, knowing that corrupted characters appeared in the text file as A tilde or A circonflexe followed by something else. I’d then change the my.cnf setting back to latin-1 to view the character strings in context and allow Blackbr|d to see what they needed to be replaced with.

Thanks. Not looking forward to the next MySQL encoding problem. They just seem to get worse and worse. (And yes, I *do* use UTF-8 all over the place.)

Similar Posts: