PHP RegEx Multi-byte String and Offset

TL;DR

CJK characters usually need 3 bytes in UTF-8.
preg_match_all with PREG_OFFSET_CAPTURE will return “byte offset” instead of “character offset”.
ereg is obsolete and it is recommended to use PCRE preg_match.

Background

Recently, I am using Google Vertex AI with text entity extraction feature. I have lot of strings, which are the OCR from a thousand of pictures, stored in database and need to label the text in specific pattern. Lot of strings are combined with English and Traditional Chinese characters.

Dealing with multi-byte string is not too hard. Since the database is connected from a Laravel application, I have to do the task in PHP way.

While-loop solution

We can perform a mb_strpos search. If the text is found, use the returned value as start index. Then use mb_strlen to get the length of the target text and add it with the start index as the end index. Store the end index as offset and pass to mb_subpos until nothing is found. Easy, right?

$somePatterns = [
    'Foo',
    'Bar',
    'Baz',
];

$someLongText = "Foo Bar 測試 Baz";
$textLength = mb_strlen($someLongText);
$result = [];

foreach ($somePatterns as $pattern) {
    $offset = 0;

    while ($textLength > $offset) {
        $foundIndex = mb_strpos($someLongText, $pattern, $offset);
        if ($foundIndex === false) {
            $offset = $textLength;

            continue;
        }

        $result[] = [
            'pattern' => $pattern,
            'start' => $foundIndex,
            'end' => $foundIndex + mb_strlen($pattern),
        ];

        $offset = $foundIndex + mb_strlen($pattern);
    }
}

echo json_encode($result);

[
    {"pattern":"Foo","start":0,"end":3},
    {"pattern":"Bar","start":4,"end":7},
    {"pattern":"Baz","start":11,"end":14}
]

Issues

But what if time concerned? Also some people may also afraid on dealing with while-loops. We can use Regular Expression (RegEx) to find all the text which matches the pattern.

$thatString = "Foo Bar 測試 Baz";

preg_match_all("/Baz/iu", $thatString, $matches, PREG_OFFSET_CAPTURE);

echo json_encode($matches);

But why is the offset 15 instead of 11?

We all know UTF-8 is a multi-byte encoding and CJK characters usually need 3 bytes. mbstring module handles UTF-8 correctly, but preg_match_all cannot.

The computer sees this instead.

46 6F 6F // Foo
20       // <Space>
42 61 72 // Bar
20       // <Space>
E6 B8 AC // 測
E8 A9 A6 // 試
20       // <Space>
42 61 7A // Baz

That’s why Baz is starting at offset 15. We can call it a “byte offset”. So how can I make it compatible with multi-byte characters?

First, we can grab the string from the beginning to the “byte offset”:

$someLongText = "Foo Bar 測試 Baz";
$byteOffset = 15;

$stringBeforeByteOffset = substr($someLongText, 0, $byteOffset);
// This grabs "Foo Bar 測試 "

Then we can get the length by utilising mb_strlen function.

$characterStartOffset = mb_strlen($stringBeforeByteOffset); // 11

And finally get the end offset by adding the length of matched text.

$fullMatchString = "Baz";
$characterEndOffset = $characterStartOffset + mb_strlen($fullMatchString);
// 11 + 3 = 14, we got the correct offset now 🎉

Profit!

Last but not least

Don’t use any ereg functions from mbstring module. The original ereg is removed since PHP 7.0. It is recommended to use PCRE preg_match instead.