Visiting each character in a string
25 Apr 2007So I’ve got this string (in PHP) and I need to scan through it character by character. I can’t scan byte by byte because it’s 2007, our users write in all sorts of languages, and the string is UTF-8.
The PHP 5 solution uses mb_strlen() to find the length and then mb_substr() to grab each character:
$j = mb_strlen($theString);
for ($k = 0; $k < $j; $k++) {
$char = mb_substr($theString, $k, 1);
// do stuff with $char
}
In PHP 6, one would do:
foreach (new TextIterator($theString, TextIterator::CHARACTER) as $char) {
// do stuff with $char
}
Some rough benchmarks on a 1500 character (and 2900 byte) string (Linux, whatever processor is inside this Thinkpad T43 here, your mileage may vary, etc etc etc) give me about 61 scans/sec with PHP 5.2.1, where a “scan” is just moving through the loop above with mb_substr and doing one if() test comparing the char to ‘<’
Under PHP 6.0.0-dev with unicode.semantics=on, switching from mb_strlen() and mb_substr() to regular strlen() and substr() produces about the same result. And indexing with $theString[$k] is the same speed as substr().
However, the TextIterator case is much faster, about 450 scans/sec!
Nicely done!