栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

正则表达式检测无效的UTF

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

正则表达式检测无效的UTF

您可以使用此PCRE正则表达式来检查字符串中的有效UTF-8。如果正则表达式匹配,则该字符串包含无效的字节序列。它是100%可移植的,因为它不依赖于PCRE_UTF8进行编译。

$regex = '/(    [xC0-xC1] # Invalid UTF-8 Bytes    | [xF5-xFF] # Invalid UTF-8 Bytes    | xE0[x80-x9F] # Overlong encoding of prior pre point    | xF0[x80-x8F] # Overlong encoding of prior pre point    | [xC2-xDF](?![x80-xBF]) # Invalid UTF-8 Sequence Start    | [xE0-xEF](?![x80-xBF]{2}) # Invalid UTF-8 Sequence Start    | [xF0-xF4](?![x80-xBF]{3}) # Invalid UTF-8 Sequence Start    | (?<=[x00-x7FxF5-xFF])[x80-xBF] # Invalid UTF-8 Sequence Middle    | (?<![xC2-xDF]|[xE0-xEF]|[xE0-xEF][x80-xBF]|[xF0-xF4]|[xF0-xF4][x80-xBF]|[xF0-xF4][x80-xBF]{2})[x80-xBF] # Overlong Sequence    | (?<=[xE0-xEF])[x80-xBF](?![x80-xBF]) # Short 3 byte sequence    | (?<=[xF0-xF4])[x80-xBF](?![x80-xBF]{2}) # Short 4 byte sequence    | (?<=[xF0-xF4][x80-xBF])[x80-xBF](?![x80-xBF]) # Short 4 byte sequence (2))/x';

我们可以通过创建一些文本变体来对其进行测试:

// Overlong encoding of pre point 0$text = chr(0xC0) . chr(0x80);var_dump(preg_match($regex, $text)); // int(1)// Overlong encoding of 5 byte encoding$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);var_dump(preg_match($regex, $text)); // int(1)// Overlong encoding of 6 byte encoding$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        var_dump(preg_match($regex, $text)); // int(1)// High pre-point without trailing characters$text = chr(0xD0) . chr(0x01);var_dump(preg_match($regex, $text)); // int(1)

等等…

实际上,由于此匹配无效字节,因此可以在preg_replace中使用它替换掉它们:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 pre-points


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/369437.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号