Oracle的
Pattern类的实现的字符类解析代码中存在一些奇怪的巫毒,如果从Oracle网站下载JRE / JDK或使用OpenJDK,则JDO /
JDK会附带这些巫毒。我还没有检查其他JVM(尤其是GNU
Classpath)实现如何解析问题中的正则表达式。
从这一点出发,对
Pattern类及其内部工作的任何引用都严格限于Oracle的实现(参考实现)。
Pattern如问题所示,阅读和理解类如何解析嵌套的否定将花费一些时间。但是,我编写了一个程序1从
Pattern对象中提取信息(使用Reflection
API),以查看编译结果。下面的输出来自在Java
HotSpot Client VM版本1.7.0_51上运行我的程序。
1:目前,该程序令人尴尬。完成并重构后,我将使用链接更新此帖子。
[^0-9]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match
这没什么奇怪的。
[^[^0-9]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^[^[^0-9]]]Start. Start unanchored match (minLength=1)CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match
上面接下来的2种情况被编译为与相同的程序
[^0-9],这是 违反直觉的 。
[[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match[D2]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Ctype. Match POSIX character class DIGIT (US-ASCII) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match
如问题中所述,在上述2种情况下没有什么奇怪的。
[013-9]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 2 character(s): [U+0030][U+0031] 01 Pattern.rangeFor (character range). Match any character within the range from pre point U+0033 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match[^D2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Ctype. Match POSIX character class DIGIT (US-ASCII) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match
如问题中所述,这两个案例按预期工作。但是,请注意引擎如何对第一个字符类(
D)进行补充,并将集差异应用于由剩余字符组成的字符类。
[^[^0-9]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match[^[^[^0-9]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match[^[^[^[^0-9]]]2]Start. Start unanchored match (minLength=1)Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2LastNodeNode. Accept match
正如Keppil在评论中进行的测试所证实的那样,上面的输出显示上述所有3个正则表达式都编译到了同一程序中!
[^2[^0-9]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2 CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match
取而代之的
NOT(UNIOn(2, NOT(0-9))是
0-13-9,
UNIOn(NOT(2), NOT(0-9))等于
NOT(2)。
[^2[^[^0-9]]]Start. Start unanchored match (minLength=1)Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (pre point <= 255). Match the following 1 character(s): [U+0032] 2 CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from pre point U+0030 to pre point U+0039 (both ends inclusive)LastNodeNode. Accept match
由于相同的错误,正则表达式可
[^2[^[^0-9]]]编译到相同的程序
[^2[^0-9]]。
有一个未解决的错误,似乎具有相同的性质:JDK-6609854。
说明
初步
下面是
Pattern在进一步阅读之前应该知道的类的实现细节:
Pattern
类将a编译String
成一个节点链,每个节点负责一个明确的小责任,并将工作委托给链中的下一个节点。Node
class是所有节点的基类。CharProperty
class是所有与字符类相关Node
的的基类。BitClass
class是CharProperty
使用boolean[]
数组加快对Latin-1字符(代码点<= 255)的匹配的类的子类。它有一个add
方法,允许在编译过程中添加字符。CharProperty.complement
,Pattern.union
,Pattern.intersection
是组对应的操作方法。他们所做的是不言自明的。Pattern.setDifference
是不对称集差。
乍一看解析字符类
在查看完整的
CharProperty clazz(booleanconsume)方法代码(负责解析字符类的方法)之前,让我们看一下代码的极为简化的版本以了解代码流程:
private CharProperty clazz(boolean consume) { // [Declaration and initialization of local variables - OMITTED] BitClass bits = new BitClass(); int ch = next(); for (;;) { switch (ch) { case '^': // Negates if first char in a class, otherwise literal if (firstInClass) { // [CODE OMITTED] ch = next(); continue; } else { // ^ not first in class, treat as literal break; } case '[': // [CODE OMITTED] ch = peek(); continue; case '&': // [CODE OMITTED] continue; case 0: // [CODE OMITTED] // Unclosed character class is checked here break; case ']': // [CODE OMITTED] // The only return statement in this method // is in this case break; default: // [CODE OMITTED] break; } node = range(bits); // [CODE OMITTED] ch = peek(); }}代码基本上会读取输入(输入
String转换为代码点的以 null结束
int[]的输入),直到命中
]或到达String的末尾(未封闭字符类)为止。
该代码在该模块内部有点混乱
continue和
break混在一起
switch。但是,只要您意识到它
continue属于外
for循环并
break属于
switch块,代码就很容易理解:
- 结尾的案例
continue
将永远不会在switch
语句后执行代码。 - 结尾的案例
break
可以在switch
语句之后执行代码(如果return
尚未执行)。
通过上面的观察,我们可以看到,每当 发现一个非特殊字符并且应将其包括在字符类中时 ,我们将在
switch语句之后执行代码,这
node =range(bits);是第一个语句。
如果检查源代码,则该方法
CharPropertyrange(BitClassbits)将解析“字符类中的单个字符或字符范围”。该方法要么返回
BitClass传入的相同对象(添加新字符),要么返回
CharProperty类的新实例。
血腥细节
接下来,让我们看一下完整的代码版本(
&&省略解析字符类交集的部分):
private CharProperty clazz(boolean consume) { CharProperty prev = null; CharProperty node = null; BitClass bits = new BitClass(); boolean include = true; boolean firstInClass = true; int ch = next(); for (;;) { switch (ch) { case '^': // Negates if first char in a class, otherwise literal if (firstInClass) { if (temp[cursor-1] != '[') break; ch = next(); include = !include; continue; } else { // ^ not first in class, treat as literal break; } case '[': firstInClass = false; node = clazz(true); if (prev == null) prev = node; else prev = union(prev, node); ch = peek(); continue; case '&': // [CODE OMITTED] // There are interesting things (bugs) here, // but it is not relevant to the discussion. continue; case 0: firstInClass = false; if (cursor >= patternLength) throw error("Unclosed character class"); break; case ']': firstInClass = false; if (prev != null) { if (consume) next(); return prev; } break; default: firstInClass = false; break; } node = range(bits); if (include) { if (prev == null) { prev = node; } else { if (prev != node) prev = union(prev, node); } } else { if (prev == null) { prev = node.complement(); } else { if (prev != node) prev = setDifference(prev, node); } } ch = peek(); }}纵观代码
case '[':中的
switch语句和之后的代码
switch语句:
- 所述
node
变量存储解析的结果 单元 (一个独立的字符,字符范围,一个速记字符类,一POSIX / Unipre字符类或嵌套字符类) - 该
prev
变量存储编译的结果,到目前为止,始终更新我们编译一个之后 单元 在node
。
由于
booleaninclude记录字符类是否被否定的局部变量永远不会传递给任何方法调用,因此只能在此方法中对其进行操作。唯一
include的读取和处理的地方是
switch语句之后。



