Fixing Broken UTF-8
25 Aug 2015When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower()
and strtoupper()
mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.
To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD):
Now I can keep the real invalid byte sequences in my raw book source code (which makes my automatic “does the output of this code example match what it’s supposed to?” checker happy) but end up with a nice � (constructed from three valid bytes) in the formatted output.
Tagged with php