Visiting each character in a string

25 Apr 2007

So I’ve got this string (in PHP) and I need to scan through it character by character. I can’t scan byte by byte because it’s 2007, our users write in all sorts of languages, and the string is UTF-8.

The PHP 5 solution uses mb_strlen() to find the length and then mb_substr() to grab each character:

$j = mb_strlen($theString);

for ($k = 0; $k < $j; $k++) {

$char = mb_substr($theString, $k, 1);

// do stuff with $char

}

In PHP 6, one would do:

foreach (new TextIterator($theString, TextIterator::CHARACTER) as $char) {

// do stuff with $char

}

Some rough benchmarks on a 1500 character (and 2900 byte) string (Linux, whatever processor is inside this Thinkpad T43 here, your mileage may vary, etc etc etc) give me about 61 scans/sec with PHP 5.2.1, where a “scan” is just moving through the loop above with mb_substr and doing one if() test comparing the char to ‘<’

Under PHP 6.0.0-dev with unicode.semantics=on, switching from mb_strlen() and mb_substr() to regular strlen() and substr() produces about the same result. And indexing with $theString[$k] is the same speed as substr().

However, the TextIterator case is much faster, about 450 scans/sec!

Nicely done!

Tagged with php , ning

Visiting each character in a string

Related Posts

Degrees of Freedom 30 Sep 2015

Fixing Broken UTF-8 25 Aug 2015

Default SSL/TLS in Different PHP Versions 18 Jun 2015