Fixing Broken UTF-8

25 Aug 2015

When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower() and strtoupper() mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.

To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD):

Now I can keep the real invalid byte sequences in my raw book source code (which makes my automatic “does the output of this code example match what it’s supposed to?” checker happy) but end up with a nice � (constructed from three valid bytes) in the formatted output.

Tagged with php

Fixing Broken UTF-8

Related Posts

Degrees of Freedom 30 Sep 2015

Default SSL/TLS in Different PHP Versions 18 Jun 2015

My Android Email Client 26 May 2015