Finally out of MySQL encoding hell

[fr] Description de comment je me suis sortie des problèmes d'encodage qui résultaient en l'affichage de hiéroglyphes sur tous les sites hébergés sur mon serveur.

It took weeks, mainly because I was busy with a car accident and the end of school, but it also took about two real whole days of head-banging on the desk to get it fixed.

Here’s what happened: remember, a long time ago, I had trouble with stuff in my database which was supposed to be UTF-8 but seemed to be ISO-8859-1? And then, sometime later, I had a weird mixture of UTF-8 and ISO-8859-1 in the same database?

Well, somewhere along the line this is what I guess happened: my database installation must have been serving UTF-8 content as ISO-8859-1, leading me to believe it was ISO-8859-1 when it was in fact UTF-8. That led me to try to convert it to UTF-8 — meaning I took UTF-8 strings and ran them through a converter supposed to turn ISO-8859-1 into UTF-8. The result? Let’s call it “double-UTF-8” (doubly encoded UTF-8), for want of a better name.

Anyway, that’s what I had in my database. When we upgraded MySQL and PHP on the server, I suddenly started seeing a load of junk instead of my accented characters:

What I was seeing looked furiously like UTF-8 looks when your server setup is messed up and serves it as ISO-8859-1 instead. But, as you can see on the picture above, this page was being served as UTF-8 by the server. How did I know it wasn’t ISO-8859-1 in my database instead of this hypothetical “double-UTF-8”? Well, for one, I knew the page was served as UTF-8, and I also know that ISO-8859-1 (latin-1) served as UTF-8 makes accented characters look like question marks. Then, if I wanted to be sure, I could just change the page encoding in Firefox to ISO-8859-1 (that should make it look right if it was ISO-8859-1, shouldn’t it?) Well, it made it look worse.

Another indication was that when the MySQL connection encoding (in my.cnf) was set back to latin-1 (ISO-8859-1), the pages seemed to display correctly, but WordPress broke.

The first post on the picture I’m showing here looks “OK”, because it was posted after the setup was changed. It really is UTF-8.

Now how did we solve this? My initial idea was to take the “double-UTF-8” content of the database (and don’t forget it was mixed with the more recent UTF-8 content) and convert it “from UTF-8 to ISO-8859-1”. I had a python script we had used to fix the last MySQL disaster which converted everything to UTF-8 — I figured I could reverse it. So I rounded up a bunch of smart people (dda_, sbp, bonsaikitten and Blackb|rd — and countless others, sorry if I forgot you!) and got to work.

It proved a hairier problem than expected. What also proved hairy was explaining the problem to people who wanted to help and insisted in misunderstanding the situation. In the end, we produced a script (well, “they” rather than “we”) which looked like it should work, only… it did nothing. If you’re really interested in looking at it, here it is — but be warned, don’t try it.

We tried recode. We tried iconv. We tried changing my.cnf settings, dumping the databases, changing them back, and importing the dumps. Finally, the problem was solved manually.

Made a text file listing the databases which needed to be cured (dblist.txt).
Dumped them all: for db in $(cat dblist.txt); do mysqldump --opt -u user -ppassword ${db} > ${db}-20060712.sql; done
Sent them over to Blackb|rd who did some search and replace magic in vim, starting with this list of characters (just change the browser encoding to latin-1 to see what they look like when mangled)
Imported the corrected dumps back in: for db in $(cat dblist.txt); do mysql -u user -ppassword ${db} < ${db}-20060712.sql; done

Blackb|rd produced a shell script ~~for vim~~ ~~(?) which I’ll link to as soon as I lay my hands on the URL again~~. The list of characters to convert was produced by trial and error, knowing that corrupted characters appeared in the text file as A tilde or A circonflexe followed by something else. I’d then change the my.cnf setting back to latin-1 to view the character strings in context and allow Blackbr|d to see what they needed to be replaced with.

Thanks. Not looking forward to the next MySQL encoding problem. They just seem to get worse and worse. (And yes, I do use UTF-8 all over the place.)

6 thoughts on “Finally out of MySQL encoding hell [en]”

It’s just a plain shell script, no need for vim. I used vim only as a test bed for the s///-expressions that I used with another fabulous tool: sed.

Pingback: Climb to the Stars (Stephanie Booth)

Invalid argument supplied for foreach() in wp-capabilities.php: Case Cracked!…

Finally. At last. Endlich. Enfin.

Once more, while trying to transfer a WordPress installation from one server to another, I found myself facing the dreaded problem which locks me out of my WordPress install with a rather cryptic message:

Warning: In…

I googled encoding hell and found this –very illustrative. I am having this problem too. I hate encoding.

there is a related thingie that i have noted even some web frameworks get wrong—the ambiguity in encoding of URLs as sent by the browser. firefox2, for one, likes to send out URLs in the system encoding (or so i guess) where possible and to switch to UTF-8 when not possible.

this is why typing http://example.com/ä will give me http://example.com/%E4 in the browser address bar, but typing http://example.com/äЖ will give me http://example.com/%C3%A4%D0%96 where ä is %C3%A4. neat, eyh? means you have to have at least one fallback encoding scheme ready for when UTF-8 fails.

the nasty thing here is that (1) there is to my knowledge no way for the client to include information about URL encoding used; (2) you can only have one fallback encoding scheme that is activated by UTF-8 decoding failure—schemes like Latin-1 do not have a big chance to throw an exception on string.decode; (3) what fallback scheme to use will depend on the geolocation of the majority of your clients; (4) since URLs are typically rather short pieces of text, a heuristic encoding detection (such as the ones browser employ for web pages) should be difficult to impossible to implement; (5) if the secondary URL encoding is really governed by system settings it may prove exceedingly difficult to testdrive whatever setup you choose.

I cannot possibly express my gratitude enough for this! I’ve been beating my head against a wall forever on this issue. I tried the same things you had without success. It seemed to me that iconv-ish attempts (along with “saving as” or “opening as” UTF8 file types in SublimeText and Notepad++) only convert the file type. They don’t convert the characters themselves. Some instructions I found on WordPress (http://codex.wordpress.org/Converting_Database_Character_Sets) did convert the characters – but had the nasty side effect of deleting half the content of random posts and pages. Thanks again! You guys rawk!

Blackb|rd says:

12.07.2006 at 11:25

It’s just a plain shell script, no need for vim. I used vim only as a test bed for the s///-expressions that I used with another fabulous tool: sed.
Pingback: Climb to the Stars (Stephanie Booth)
Climb to the Stars (Stephanie says:

03.02.2007 at 22:53

Invalid argument supplied for foreach() in wp-capabilities.php: Case Cracked!…

Finally. At last. Endlich. Enfin.

Once more, while trying to transfer a WordPress installation from one server to another, I found myself facing the dreaded problem which locks me out of my WordPress install with a rather cryptic message:

Warning: In…
Ana says:

24.09.2008 at 10:07

Hi

I googled encoding hell and found this –very illustrative. I am having this problem too. I hate encoding.
jaimehall says:

07.10.2012 at 16:29

there is a related thingie that i have noted even some web frameworks get wrong—the ambiguity in encoding of URLs as sent by the browser. firefox2, for one, likes to send out URLs in the system encoding (or so i guess) where possible and to switch to UTF-8 when not possible.

this is why typing http://example.com/ä will give me http://example.com/%E4 in the browser address bar, but typing http://example.com/äЖ will give me http://example.com/%C3%A4%D0%96 where ä is %C3%A4. neat, eyh? means you have to have at least one fallback encoding scheme ready for when UTF-8 fails.

the nasty thing here is that (1) there is to my knowledge no way for the client to include information about URL encoding used; (2) you can only have one fallback encoding scheme that is activated by UTF-8 decoding failure—schemes like Latin-1 do not have a big chance to throw an exception on string.decode; (3) what fallback scheme to use will depend on the geolocation of the majority of your clients; (4) since URLs are typically rather short pieces of text, a heuristic encoding detection (such as the ones browser employ for web pages) should be difficult to impossible to implement; (5) if the secondary URL encoding is really governed by system settings it may prove exceedingly difficult to testdrive whatever setup you choose.
Paul says:

19.03.2014 at 21:14

I cannot possibly express my gratitude enough for this! I’ve been beating my head against a wall forever on this issue. I tried the same things you had without success. It seemed to me that iconv-ish attempts (along with “saving as” or “opening as” UTF8 file types in SublimeText and Notepad++) only convert the file type. They don’t convert the characters themselves. Some instructions I found on WordPress (http://codex.wordpress.org/Converting_Database_Character_Sets) did convert the characters – but had the nasty side effect of deleting half the content of random posts and pages. Thanks again! You guys rawk!

Lyonel Kaufmann on Balcon méditatif au chalet16.06.2026
Chère Stéph', Ton texte m'a saisit par les tripes et par mon âme. Je t'entends. A l'écoute. C'est peu. Alors…
Stephanie Booth on MySpace supprime les profils de 29'000 "délinquants sexuels"19.05.2026
Se faire engueuler dans les comms de son blog avec quasi 20 ans de retard 👌
imright on MySpace supprime les profils de 29'000 "délinquants sexuels"19.05.2026
"choses aussi anodines que: relations homosexuelles, nudisme, uriner dans un lieu public, faire l’amour dans un lieu public, etc. "…
Merethe Søndergaard on My Very Old Oscar09.05.2026
I feel your pain and I can relate with my super senior cat
Konstantin on Ghosts in the Heart08.05.2026
Thank you for capturing this so well. Now I have words to come back to, when the next wave of…

Finally out of MySQL encoding hell [en]

Related

6 thoughts on “Finally out of MySQL encoding hell [en]”

Leave a Reply

Related posts:

Related

6 thoughts on “Finally out of MySQL encoding hell [en]”

Leave a Reply