为了可靠地在Java源文件中找到所有注释,我不会使用regex,而是使用真正的词法分析器(aka Tokenizer)。
Java的两个流行选择是:
- JFlex:http://jflex.de
- ANTLR:http://www.antlr.org
与流行的看法相反,ANTLR也可用于 仅 创建词法分析器而不使用语法分析器。
这是ANTLR快速演示。您需要在同一目录中包含以下文件:
- antlr-3.2.jar
- JavaCommentLexer.g(语法)
- Main.java
- Test.java(有效(!)的Java源文件,带有奇异注释)
JavaCommentLexer.g
lexer grammar JavaCommentLexer;options { filter=true;}SingleLineComment : FSlash FSlash ~('r' | 'n')* ;MultiLineComment : FSlash Star .* Star FSlash ;StringLiteral : DQuote ( (EscapedDQuote)=> EscapedDQuote | (EscapedBSlash)=> EscapedBSlash | Octal | Unipre | ~('\' | '"' | 'r' | 'n') )* DQuote {skip();} ;CharLiteral : SQuote ( (EscapedSQuote)=> EscapedSQuote | (EscapedBSlash)=> EscapedBSlash | Octal | Unipre | ~('\' | ''' | 'r' | 'n') ) SQuote {skip();} ;fragment EscapedDQuote : BSlash DQuote ;fragment EscapedSQuote : BSlash SQuote ;fragment EscapedBSlash : BSlash BSlash ;fragment FSlash : '/' | '\' ('u002f' | 'u002F') ;fragment Star : '*' | '\' ('u002a' | 'u002A') ;fragment BSlash : '\' ('u005c' | 'u005C')? ;fragment DQuote : '"' | '\u0022' ;fragment SQuote : ''' | '\u0027' ;fragment Unipre : '\u' Hex Hex Hex Hex ;fragment Octal : '\' ('0'..'3' Oct Oct | Oct Oct | Oct) ;fragment Hex : '0'..'9' | 'a'..'f' | 'A'..'F' ;fragment Oct : '0'..'7' ;Main.java
import org.antlr.runtime.*;public class Main { public static void main(String[] args) throws Exception { JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream("Test.java")); CommonTokenStream tokens = new CommonTokenStream(lexer); for(Object o : tokens.getTokens()) { CommonToken t = (CommonToken)o; if(t.getType() == JavaCommentLexer.SingleLineComment) { System.out.println("SingleLineComment :: " + t.getText().replace("n", "\n")); } if(t.getType() == JavaCommentLexer.MultiLineComment) { System.out.println("MultiLineComment :: " + t.getText().replace("n", "\n")); } } }}Test.java
u002fu002a <- multi line comment startmultilinecomment // not a single line commentu002A/public class Test { // single line "not a string" String s = "u005C" 242 not // a comment \" u002f u005Cu005C u0022; char c = u0027"'; // the " is not the start of a string char q1 = 'u005c''; // == ''' char q2 = 'u005cu0027'; // == ''' char q3 = u0027u005cu0027u0027; // == ''' char c4 = ' 47'; String t = "";}现在,要运行演示,请执行以下操作:
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.gbart@hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.javabart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
并且您将看到以下内容打印到控制台:
MultiLineComment :: u002fu002a <- multi line comment startnmultinlinencomment // not a single line commentnu002A/SingleLineComment :: // single line "not a string"SingleLineComment :: // a comment \" u002f u005Cu005C u0022;MultiLineComment :: SingleLineComment :: // the " is not the start of a stringSingleLineComment :: // == '''SingleLineComment :: // == '''SingleLineComment :: // == '''SingleLineComment :: u002fu002f another single line comment
编辑
当然,您可以使用正则表达式自己创建一种词法分析器。但是,以下演示不处理源文件中的Unipre文字:
Test2.java
public class Test2 { // single line "not a string" String s = "" 242 not // a comment \" "; char c = '"'; // the " is not the start of a string char q1 = '''; // == ''' char c4 = ' 47'; String t = "";}Main2.java
import java.util.*;import java.io.*;import java.util.regex.*;public class Main2 { private static String read(File file) throws IOException { StringBuilder b = new StringBuilder(); Scanner scan = new Scanner(file); while(scan.hasNextLine()) { String line = scan.nextLine(); b.append(line).append('n'); } return b.toString(); } public static void main(String[] args) throws Exception { String contents = read(new File("Test2.java")); String slComment = "//[^rn]*"; String mlComment = "/\*[\s\S]*?\*/"; String strLit = ""(?:\\.|[^\\"rn])*""; String chLit = "'(?:\\.|[^\\'rn])+'"; String any = "[\s\S]"; Pattern p = Pattern.compile( String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any) ); Matcher m = p.matcher(contents); while(m.find()) { String hit = m.group(); if(m.group(1) != null) { System.out.println("SingleLine :: " + hit.replace("n", "\n")); } if(m.group(2) != null) { System.out.println("MultiLine :: " + hit.replace("n", "\n")); } } }}如果运行
Main2,则会在控制台上打印以下内容:
MultiLine :: SingleLine :: // single line "not a string"MultiLine :: SingleLine :: // the " is not the start of a stringSingleLine :: // == '''SingleLine :: // another single line comment



