Even faster UTF-8 character counting
I recently came across two articles, "Counting characters in UTF-8 strings is fast" by Kragen Sitaker, and "Counting characters in UTF-8 strings is fast(er)" by George Pollard, which provide a series of successively faster ways of (as the article names suggest) counting the number of UTF-8 characters in a NUL-terminated string. We can do better.Kragen takes the approach of examining each byte in sequence and asking if (a) is it the terminating NUL, and (b) is it the first of a UTF-8 character. This last test is quite easy: Bytes 0x01 through 0x7F in UTF-8 represent the corresponding ASCII characters, while bytes 0xC0 through 0xFF are the first byte of a multi-byte character. This results in the following inner loop (modulo some style changes to make it easier to compare this against later versions):
while (s[i]) { if ((s[i] & 0xC0) != 0x80) j++; i++; } return (j);
Kragen continues by comparing this to an optimized version written in x86 assembly language by Aristotle Pagaltzis; Aristotle's version cleverly takes advantage of the shl instruction setting the sign, carry, and zero flags, but otherwise applies exactly the same algorithm:
loopa: dec %ecx loopb: lodsb shl $1, %al js loopa jc loopb jnz loopaHowever, this assembly language version, like Kragen's C version, inspects each of the bytes one by one, which inherently limits the performance.
George Pollard makes the assumption that the input string is valid UTF-8, and notices that by looking at the first byte of a multibyte character, we can determine the length of the character: If the first byte is between 0xC0 and 0xDF, the UTF-8 character has two bytes; if it is between 0xE0 and 0xEF, the UTF-8 character has 3 bytes; and if it is 0xF0 and 0xFF, the UTF-8 character has 4 bytes. After reading the first byte of a multibyte character, George skips over the trailing bytes. He also fast-paths the handling of ASCII characters, treating characters as signed bytes in order to distinguish between ASCII and non-ASCII characters, while giving a wonderful example of using a goto to jump from the middle of one loop into the middle of another:
while (s[i] > 0) { ascii: i++; } count += i - iBefore; while (s[i]) { if (s[i] > 0) { iBefore = i; goto ascii; } else { switch (0xF0 & s[i]) { case 0xE0: i += 3; break; case 0xF0: i += 4; break; default: i += 2; break; } } count++; }While this code is considerably faster than both Kragen's C code and Aristotle's assembly code, it suffers from two performance limiting factors: First, it uses conditional branches which will only be consistently predicted correctly if all of the characters encountered have the same length; and second, it still inspects characters one by one.
This can be improved in three ways:
- Instead of using conditional branches, identify the initial bytes of UTF-8 characters using logical operations only.
- Instead of handling one character at once, vectorize: Handle lots of bytes in parallel.
- In order to reduce the cost of waiting for memory, prefetch data if possible.
#define ONEMASK ((size_t)(-1) / 0xFF) static size_t cp_strlen_utf8(const char * _s) { const char * s; size_t count = 0; size_t u; unsigned char b; /* Handle any initial misaligned bytes. */ for (s = _s; (uintptr_t)(s) & (sizeof(size_t) - 1); s++) { b = *s; /* Exit if we hit a zero byte. */ if (b == '\0') goto done; /* Is this byte NOT the first byte of a character? */ count += (b >> 7) & ((~b) >> 6); } /* Handle complete blocks. */ for (; ; s += sizeof(size_t)) { /* Prefetch 256 bytes ahead. */ __builtin_prefetch(&s[256], 0, 0); /* Grab 4 or 8 bytes of UTF-8 data. */ u = *(size_t *)(s); /* Exit the loop if there are any zero bytes. */ if ((u - ONEMASK) & (~u) & (ONEMASK * 0x80)) break; /* Count bytes which are NOT the first byte of a character. */ u = ((u & (ONEMASK * 0x80)) >> 7) & ((~u) >> 6); count += (u * ONEMASK) >> ((sizeof(size_t) - 1) * 8); } /* Take care of any left-over bytes. */ for (; ; s++) { b = *s; /* Exit if we hit a zero byte. */ if (b == '\0') break; /* Is this byte NOT the first byte of a character? */ count += (b >> 7) & ((~b) >> 6); } done: return ((s - _s) - count); }
How much faster is this? I put together a a slightly improved version of Kragen's benchmark code, using a buffer filled with valid UTF-8 text instead of his more artificial test cases, and ran it on an Opteron 848 @ 2.2 GHz running FreeBSD 7.0-RELEASE-p1 after compiling with gcc 4.2.1 with the -O3 flag set. Some notes to help decipher the output:
-
The function names and their meanings are
- gcc_strlen = the strlen() function as compiled by gcc;
- kjs_strlen = Kragen's non-UTF-8 strlen function;
- cp_strlen = my non-UTF-8 strlen function (not shown here, but see the source code if you're interested);
- kjs_strlen_utf8 = Kragen's UTF-8 character counter;
- gp_strlen_utf8 = George's UTF-8 character counter; and
- cp_strlen_utf8 = my UTF-8 character counter.
- The test strings are "hello, world", "naïve", and "こんにちは".
- The values printed on each line before the colon are the result computed -- the number of bytes for strlen functions, and the number of UTF-8 characters for strlen_utf8 functions; the values after the colon are the mean and standard deviation time taken in seconds.
The improvement is striking:
testing 33554424 bytes of repeated "hello, world": gcc_strlen = 33554424: 0.034169 +/- 0.000090 kjs_strlen = 33554424: 0.049529 +/- 0.000280 cp_strlen = 33554424: 0.011357 +/- 0.000030 kjs_strlen_utf8 = 33554424: 0.060930 +/- 0.000031 gp_strlen_utf8 = 33554424: 0.049675 +/- 0.000294 cp_strlen_utf8 = 33554424: 0.014049 +/- 0.000047 testing 33554430 bytes of repeated "na?ve": gcc_strlen = 33554430: 0.034168 +/- 0.000069 kjs_strlen = 33554430: 0.049544 +/- 0.000287 cp_strlen = 33554430: 0.011348 +/- 0.000021 kjs_strlen_utf8 = 27962025: 0.061020 +/- 0.000291 gp_strlen_utf8 = 27962025: 0.059726 +/- 0.000029 cp_strlen_utf8 = 27962025: 0.014041 +/- 0.000043 testing 33554430 bytes of repeated "?????": gcc_strlen = 33554430: 0.034157 +/- 0.000088 kjs_strlen = 33554430: 0.049437 +/- 0.000018 cp_strlen = 33554430: 0.011438 +/- 0.000286 kjs_strlen_utf8 = 11184810: 0.060919 +/- 0.000032 gp_strlen_utf8 = 11184810: 0.027454 +/- 0.000031 cp_strlen_utf8 = 11184810: 0.014133 +/- 0.000287Not only is vectorized character counting faster than the "look at a byte, skip a few" approach, it isn't even close: Even when the characters are 3 bytes each (as in the case of "こんにちは"), the vectorized approach wins by a factor of 2; and its lead is larger when the skipping approach can't skip as many bytes. Moreover, vectorized character counting is only 30% slower than a vectorized strlen and more than twice as fast as a non-vectorized strlen -- although given that character counting runs at slightly faster than one byte per clock cycle, it's not surprising that non-vectorized code can't keep up!
Can we do better? I think so. My code uses 64-bit integer registers to manipulate 8 bytes at once; this is the same size as MMX registers, so those probably won't be very useful, but with SSE2 16 can be manipulated at once, which could provide another doubling of the performance.
Beyond a doubling? Well, the first rule of optimization is to start by finding a good algorithm -- and any algorithm in which the critical path involves counting UTF-8 characters in a 32 megabyte NUL-terminated string is doing something wrong. This is very much a toy problem; but the lesson it teaches is worth remembering: Vectorization is good!