栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

正则表达式字符类双重否定中的错误?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

正则表达式字符类双重否定中的错误?

Oracle的

Pattern
类的实现的字符类解析代码中存在一些奇怪的巫毒,如果从Oracle网站下载JRE / JDK或使用OpenJDK,则JDO /
JDK会附带这些巫毒。我还没有检查其他JVM(尤其是GNU
Classpath)实现如何解析问题中的正则表达式。

从这一点出发,对

Pattern
类及其内部工作的任何引用都严格限于Oracle的实现(参考实现)。

Pattern
如问题所示,阅读和理解类如何解析嵌套的否定将花费一些时间。但是,我编写了一个程序1从
Pattern
对象中提取信息(使用Reflection
API),以查看编译结果。下面的输出来自在Java
HotSpot Client VM版本1.7.0_51上运行我的程序。

1:目前,该程序令人尴尬。完成并重构后,我将使用链接更新此帖子。

[^0-9]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

这没什么奇怪的。

[^[^0-9]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^[^[^0-9]]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

上面接下来的2种情况被编译为与相同的程序

[^0-9]
,这是 违反直觉的

[[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[D2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Ctype. Match POSIX character class DIGIT (US-ASCII)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

如问题中所述,在上述2种情况下没有什么奇怪的。

[013-9]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 2 character(s):    [U+0030][U+0031]    01  Pattern.rangeFor (character range). Match any character within the range from pre point U+0033 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^D2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:      Ctype. Match POSIX character class DIGIT (US-ASCII)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

如问题中所述,这两个案例按预期工作。但是,请注意引擎如何对第一个字符类(

D
)进行补充,并将集差异应用于由剩余字符组成的字符类。

[^[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[^[^[^0-9]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[^[^[^[^0-9]]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

正如Keppil在评论中进行的测试所证实的那样,上面的输出显示上述所有3个正则表达式都编译到了同一程序中!

[^2[^0-9]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):      [U+0032]      2  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

取而代之的

NOT(UNIOn(2, NOT(0-9))
0-13-9
UNIOn(NOT(2), NOT(0-9))
等于
NOT(2)

[^2[^[^0-9]]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):      [U+0032]      2  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

由于相同的错误,正则表达式可

[^2[^[^0-9]]]
编译到相同的程序
[^2[^0-9]]

有一个未解决的错误,似乎具有相同的性质:JDK-6609854。


说明

初步

下面是

Pattern
在进一步阅读之前应该知道的类的实现细节:

  • Pattern
    类将a编译
    String
    成一个节点链,每个节点负责一个明确的小责任,并将工作委托给链中的下一个节点。
    Node
    class是所有节点的基类。
  • CharProperty
    class是所有与字符类相关
    Node
    的的基类。
  • BitClass
    class是
    CharProperty
    使用
    boolean[]
    数组加快对Latin-1字符(代码点<= 255)的匹配的类的子类。它有一个
    add
    方法,允许在编译过程中添加字符。
  • CharProperty.complement
    Pattern.union
    Pattern.intersection
    是组对应的操作方法。他们所做的是不言自明的。
  • Pattern.setDifference
    是不对称集差。

乍一看解析字符类

在查看完整的

CharProperty clazz(booleanconsume)
方法代码(负责解析字符类的方法)之前,让我们看一下代码的极为简化的版本以了解代码流程:

private CharProperty clazz(boolean consume) {    // [Declaration and initialization of local variables - OMITTED]    BitClass bits = new BitClass();    int ch = next();    for (;;) {        switch (ch) { case '^':     // Negates if first char in a class, otherwise literal     if (firstInClass) {         // [CODE OMITTED]         ch = next();         continue;     } else {         // ^ not first in class, treat as literal         break;     } case '[':     // [CODE OMITTED]     ch = peek();     continue; case '&':     // [CODE OMITTED]     continue; case 0:     // [CODE OMITTED]     // Unclosed character class is checked here     break; case ']':     // [CODE OMITTED]     // The only return statement in this method     // is in this case     break; default:     // [CODE OMITTED]     break;        }        node = range(bits);        // [CODE OMITTED]        ch = peek();    }}

代码基本上会读取输入(输入

String
转换为代码点的以 null结束
int[]
的输入),直到命中
]
或到达String的末尾(未封闭字符类)为止。

该代码在该模块内部有点混乱

continue
break
混在一起
switch
。但是,只要您意识到它
continue
属于外
for
循环并
break
属于
switch
块,代码就很容易理解:

  • 结尾的案例
    continue
    将永远不会在
    switch
    语句后执行代码。
  • 结尾的案例
    break
    可以在
    switch
    语句之后执行代码(如果
    return
    尚未执行)。

通过上面的观察,我们可以看到,每当 发现一个非特殊字符并且应将其包括在字符类中时 ,我们将在

switch
语句之后执行代码,这
node =range(bits);
是第一个语句。

如果检查源代码,则该方法

CharPropertyrange(BitClassbits)
将解析“字符类中的单个字符或字符范围”。该方法要么返回
BitClass
传入的相同对象(添加新字符),要么返回
CharProperty
类的新实例。

血腥细节

接下来,让我们看一下完整的代码版本(

&&
省略解析字符类交集的部分):

private CharProperty clazz(boolean consume) {    CharProperty prev = null;    CharProperty node = null;    BitClass bits = new BitClass();    boolean include = true;    boolean firstInClass = true;    int ch = next();    for (;;) {        switch (ch) { case '^':     // Negates if first char in a class, otherwise literal     if (firstInClass) {         if (temp[cursor-1] != '[')  break;         ch = next();         include = !include;         continue;     } else {         // ^ not first in class, treat as literal         break;     } case '[':     firstInClass = false;     node = clazz(true);     if (prev == null)         prev = node;     else         prev = union(prev, node);     ch = peek();     continue; case '&':     // [CODE OMITTED]     // There are interesting things (bugs) here,     // but it is not relevant to the discussion.     continue; case 0:     firstInClass = false;     if (cursor >= patternLength)         throw error("Unclosed character class");     break; case ']':     firstInClass = false;     if (prev != null) {         if (consume)  next();         return prev;     }     break; default:     firstInClass = false;     break;        }        node = range(bits);        if (include) { if (prev == null) {     prev = node; } else {     if (prev != node)         prev = union(prev, node); }        } else { if (prev == null) {     prev = node.complement(); } else {     if (prev != node)         prev = setDifference(prev, node); }        }        ch = peek();    }}

纵观代码

case '[':
中的
switch
语句和之后的代码
switch
语句:

  • 所述
    node
    变量存储解析的结果 单元 (一个独立的字符,字符范围,一个速记字符类,一POSIX / Unipre字符类或嵌套字符类)
  • prev
    变量存储编译的结果,到目前为止,始终更新我们编译一个之后 单元
    node

由于

booleaninclude
记录字符类是否被否定的局部变量永远不会传递给任何方法调用,因此只能在此方法中对其进行操作。唯一
include
的读取和处理的地方是
switch
语句之后。

正在建设中



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/508588.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号