正则表达式字符类双重否定中的错误？

Oracle的

Pattern

类的实现的字符类解析代码中存在一些奇怪的巫毒，如果从Oracle网站下载JRE / JDK或使用OpenJDK，则JDO /
JDK会附带这些巫毒。我还没有检查其他JVM（尤其是GNU
Classpath）实现如何解析问题中的正则表达式。

从这一点出发，对

Pattern

类及其内部工作的任何引用都严格限于Oracle的实现（参考实现）。

Pattern

如问题所示，阅读和理解类如何解析嵌套的否定将花费一些时间。但是，我编写了一个程序1从

Pattern

对象中提取信息（使用Reflection
API），以查看编译结果。下面的输出来自在Java
HotSpot Client VM版本1.7.0_51上运行我的程序。

1：目前，该程序令人尴尬。完成并重构后，我将使用链接更新此帖子。

[^0-9]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

这没什么奇怪的。

[^[^0-9]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^[^[^0-9]]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class:  Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

上面接下来的2种情况被编译为与相同的程序

[^0-9]

，这是 违反直觉的 。

[[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[D2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Ctype. Match POSIX character class DIGIT (US-ASCII)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

如问题中所述，在上述2种情况下没有什么奇怪的。

[013-9]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 2 character(s):    [U+0030][U+0031]    01  Pattern.rangeFor (character range). Match any character within the range from pre point U+0033 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^D2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:      Ctype. Match POSIX character class DIGIT (US-ASCII)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

如问题中所述，这两个案例按预期工作。但是，请注意引擎如何对第一个字符类（

）进行补充，并将集差异应用于由剩余字符组成的字符类。

[^[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[^[^[^0-9]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match[^[^[^[^0-9]]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):    [U+0032]    2LastNodeNode. Accept match

正如Keppil在评论中进行的测试所证实的那样，上面的输出显示上述所有3个正则表达式都编译到了同一程序中！

[^2[^0-9]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):      [U+0032]      2  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

取而代之的

NOT(UNIOn(2, NOT(0-9))

是

0-13-9

，

UNIOn(NOT(2), NOT(0-9))

等于

NOT(2)

。

[^2[^[^0-9]]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below:  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s):      [U+0032]      2  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:    Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match

由于相同的错误，正则表达式可

[^2[^[^0-9]]]

编译到相同的程序

[^2[^0-9]]

。

有一个未解决的错误，似乎具有相同的性质：JDK-6609854。

说明

初步

下面是

Pattern

在进一步阅读之前应该知道的类的实现细节：

```
Pattern
```
类将a编译
```
String
```
成一个节点链，每个节点负责一个明确的小责任，并将工作委托给链中的下一个节点。
```
Node
```
class是所有节点的基类。
```
CharProperty
```
class是所有与字符类相关
```
Node
```
的的基类。
```
BitClass
```
class是
```
CharProperty
```
使用
```
boolean[]
```
数组加快对Latin-1字符（代码点<= 255）的匹配的类的子类。它有一个
```
add
```
方法，允许在编译过程中添加字符。
```
CharProperty.complement
```
，
```
Pattern.union
```
，
```
Pattern.intersection
```
是组对应的操作方法。他们所做的是不言自明的。
```
Pattern.setDifference
```
是不对称集差。

乍一看解析字符类

在查看完整的

CharProperty clazz(booleanconsume)

方法代码（负责解析字符类的方法）之前，让我们看一下代码的极为简化的版本以了解代码流程：

private CharProperty clazz(boolean consume) {    // [Declaration and initialization of local variables - OMITTED]    BitClass bits = new BitClass();    int ch = next();    for (;;) {        switch (ch) { case '^':     // Negates if first char in a class, otherwise literal     if (firstInClass) {         // [CODE OMITTED]         ch = next();         continue;     } else {         // ^ not first in class, treat as literal         break;     } case '[':     // [CODE OMITTED]     ch = peek();     continue; case '&':     // [CODE OMITTED]     continue; case 0:     // [CODE OMITTED]     // Unclosed character class is checked here     break; case ']':     // [CODE OMITTED]     // The only return statement in this method     // is in this case     break; default:     // [CODE OMITTED]     break;        }        node = range(bits);        // [CODE OMITTED]        ch = peek();    }}

代码基本上会读取输入（输入

String

转换为代码点的以 null结束

int[]

的输入），直到命中

或到达String的末尾（未封闭字符类）为止。

该代码在该模块内部有点混乱

continue

和

break

混在一起

switch

。但是，只要您意识到它

continue

属于外

for

循环并

break

属于

switch

块，代码就很容易理解：

结尾的案例
```
continue
```
将永远不会在
```
switch
```
语句后执行代码。
结尾的案例
```
break
```
可以在
```
switch
```
语句之后执行代码（如果
```
return
```
尚未执行）。

通过上面的观察，我们可以看到，每当 发现一个非特殊字符并且应将其包括在字符类中时 ，我们将在

switch

语句之后执行代码，这

node =range(bits);

是第一个语句。

如果检查源代码，则该方法

CharPropertyrange(BitClassbits)

将解析“字符类中的单个字符或字符范围”。该方法要么返回

BitClass

传入的相同对象（添加新字符），要么返回

CharProperty

类的新实例。

血腥细节

接下来，让我们看一下完整的代码版本（

&&

省略解析字符类交集的部分）：

private CharProperty clazz(boolean consume) {    CharProperty prev = null;    CharProperty node = null;    BitClass bits = new BitClass();    boolean include = true;    boolean firstInClass = true;    int ch = next();    for (;;) {        switch (ch) { case '^':     // Negates if first char in a class, otherwise literal     if (firstInClass) {         if (temp[cursor-1] != '[')  break;         ch = next();         include = !include;         continue;     } else {         // ^ not first in class, treat as literal         break;     } case '[':     firstInClass = false;     node = clazz(true);     if (prev == null)         prev = node;     else         prev = union(prev, node);     ch = peek();     continue; case '&':     // [CODE OMITTED]     // There are interesting things (bugs) here,     // but it is not relevant to the discussion.     continue; case 0:     firstInClass = false;     if (cursor >= patternLength)         throw error("Unclosed character class");     break; case ']':     firstInClass = false;     if (prev != null) {         if (consume)  next();         return prev;     }     break; default:     firstInClass = false;     break;        }        node = range(bits);        if (include) { if (prev == null) {     prev = node; } else {     if (prev != node)         prev = union(prev, node); }        } else { if (prev == null) {     prev = node.complement(); } else {     if (prev != node)         prev = setDifference(prev, node); }        }        ch = peek();    }}

纵观代码

case '[':

中的

switch

语句和之后的代码

switch

语句：

所述
```
node
```
变量存储解析的结果单元（一个独立的字符，字符范围，一个速记字符类，一POSIX / Unipre字符类或嵌套字符类）
该
```
prev
```
变量存储编译的结果，到目前为止，始终更新我们编译一个之后单元在
```
node
```
。

由于

booleaninclude

记录字符类是否被否定的局部变量永远不会传递给任何方法调用，因此只能在此方法中对其进行操作。唯一

include

的读取和处理的地方是

switch

语句之后。

正则表达式字符类双重否定中的错误？

说明

初步

乍一看解析字符类

血腥细节

正在建设中

面试问答相关栏目本月热门文章