TL;DR
- CJK characters usually need 3 bytes in UTF-8.
preg_match_all
withPREG_OFFSET_CAPTURE
will return “byte offset” instead of “character offset”.ereg
is obsolete and it is recommended to use PCREpreg_match
.
Background
Recently, I am using Google Vertex AI with text entity extraction feature. I have lot of strings, which are the OCR from a thousand of pictures, stored in database and need to label the text in specific pattern. Lot of strings are combined with English and Traditional Chinese characters.
Dealing with multi-byte string is not too hard. Since the database is connected from a Laravel application, I have to do the task in PHP way.
While-loop solution
We can perform a mb_strpos
search. If the text is found, use the returned value as start index. Then use mb_strlen
to get the length of the target text and add it with the start index as the end index. Store the end index as offset and pass to mb_subpos
until nothing is found. Easy, right?
$somePatterns = [
'Foo',
'Bar',
'Baz',
];
$someLongText = "Foo Bar 測試 Baz";
$textLength = mb_strlen($someLongText);
$result = [];
foreach ($somePatterns as $pattern) {
$offset = 0;
while ($textLength > $offset) {
$foundIndex = mb_strpos($someLongText, $pattern, $offset);
if ($foundIndex === false) {
$offset = $textLength;
continue;
}
$result[] = [
'pattern' => $pattern,
'start' => $foundIndex,
'end' => $foundIndex + mb_strlen($pattern),
];
$offset = $foundIndex + mb_strlen($pattern);
}
}
echo json_encode($result);
[
{"pattern":"Foo","start":0,"end":3},
{"pattern":"Bar","start":4,"end":7},
{"pattern":"Baz","start":11,"end":14}
]
Issues
But what if time concerned? Also some people may also afraid on dealing with while-loops. We can use Regular Expression (RegEx) to find all the text which matches the pattern.
$thatString = "Foo Bar 測試 Baz";
preg_match_all("/Baz/iu", $thatString, $matches, PREG_OFFSET_CAPTURE);
echo json_encode($matches);
[
[
[
"Baz",
15
]
]
]
But why is the offset 15 instead of 11?
We all know UTF-8 is a multi-byte encoding and CJK characters usually need 3 bytes. mbstring
module handles UTF-8 correctly, but preg_match_all
cannot.
The computer sees this instead.
46 6F 6F // Foo
20 // <Space>
42 61 72 // Bar
20 // <Space>
E6 B8 AC // 測
E8 A9 A6 // 試
20 // <Space>
42 61 7A // Baz
That’s why Baz
is starting at offset 15. We can call it a “byte offset”. So how can I make it compatible with multi-byte characters?
First, we can grab the string from the beginning to the “byte offset”:
$someLongText = "Foo Bar 測試 Baz";
$byteOffset = 15;
$stringBeforeByteOffset = substr($someLongText, 0, $byteOffset);
// This grabs "Foo Bar 測試 "
Then we can get the length by utilising mb_strlen
function.
$characterStartOffset = mb_strlen($stringBeforeByteOffset); // 11
And finally get the end offset by adding the length of matched text.
$fullMatchString = "Baz";
$characterEndOffset = $characterStartOffset + mb_strlen($fullMatchString);
// 11 + 3 = 14, we got the correct offset now 🎉
Profit!
Last but not least
Don’t use any ereg
functions from mbstring
module. The original ereg
is removed since PHP 7.0. It is recommended to use PCRE preg_match
instead.