您可以使用此PCRE正则表达式来检查字符串中的有效UTF-8。如果正则表达式匹配,则该字符串包含无效的字节序列。它是100%可移植的,因为它不依赖于PCRE_UTF8进行编译。
$regex = '/( [xC0-xC1] # Invalid UTF-8 Bytes | [xF5-xFF] # Invalid UTF-8 Bytes | xE0[x80-x9F] # Overlong encoding of prior pre point | xF0[x80-x8F] # Overlong encoding of prior pre point | [xC2-xDF](?![x80-xBF]) # Invalid UTF-8 Sequence Start | [xE0-xEF](?![x80-xBF]{2}) # Invalid UTF-8 Sequence Start | [xF0-xF4](?![x80-xBF]{3}) # Invalid UTF-8 Sequence Start | (?<=[x00-x7FxF5-xFF])[x80-xBF] # Invalid UTF-8 Sequence Middle | (?<![xC2-xDF]|[xE0-xEF]|[xE0-xEF][x80-xBF]|[xF0-xF4]|[xF0-xF4][x80-xBF]|[xF0-xF4][x80-xBF]{2})[x80-xBF] # Overlong Sequence | (?<=[xE0-xEF])[x80-xBF](?![x80-xBF]) # Short 3 byte sequence | (?<=[xF0-xF4])[x80-xBF](?![x80-xBF]{2}) # Short 4 byte sequence | (?<=[xF0-xF4][x80-xBF])[x80-xBF](?![x80-xBF]) # Short 4 byte sequence (2))/x';我们可以通过创建一些文本变体来对其进行测试:
// Overlong encoding of pre point 0$text = chr(0xC0) . chr(0x80);var_dump(preg_match($regex, $text)); // int(1)// Overlong encoding of 5 byte encoding$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);var_dump(preg_match($regex, $text)); // int(1)// Overlong encoding of 6 byte encoding$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); var_dump(preg_match($regex, $text)); // int(1)// High pre-point without trailing characters$text = chr(0xD0) . chr(0x01);var_dump(preg_match($regex, $text)); // int(1)
等等…
实际上,由于此匹配无效字节,因此可以在preg_replace中使用它替换掉它们:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 pre-points



