sklar.com

...composed of an indefinite, perhaps infinite number of hexagonal galleries...

© 1994-2014. David Sklar. All rights reserved.

Visiting each character in a string

So I’ve got this string (in PHP) and I need to scan through it character by character. I can’t scan byte by byte because it’s 2007, our users write in all sorts of languages, and the string is UTF-8.

The PHP 5 solution uses mb_strlen() to find the length and then mb_substr() to grab each character:

$j = mb_strlen($theString);
for ($k = 0; $k < $j; $k++) {
$char = mb_substr($theString, $k, 1);
// do stuff with $char
}

In PHP 6, one would do:

foreach (new TextIterator($theString, TextIterator::CHARACTER) as $char) {
// do stuff with $char
}

Some rough benchmarks on a 1500 character (and 2900 byte) string (Linux, whatever processor is inside this Thinkpad T43 here, your mileage may vary, etc etc etc) give me about 61 scans/sec with PHP 5.2.1, where a “scan” is just moving through the loop above with mb_substr and doing one if() test comparing the char to ‘<’

Under PHP 6.0.0-dev with unicode.semantics=on, switching from mb_strlen() and mb_substr() to regular strlen() and substr() produces about the same result. And indexing with $theString[$k] is the same speed as substr().

However, the TextIterator case is much faster, about 450 scans/sec!

Nicely done!

Tagged with php , ning