ANTLR 4与Unicode

1 词法器与Unicode

从4.7版本起，完全支持Unicode。

C++、Python、Go和Swift原生支持

Java、C#和JavaScript需要使用CharStream。以前的ANTLRInputStream只支持到U+FFFF。

public static void main(String[] args) {
  CharStream charStream = CharStreams.fromPath(Paths.get(args[0]));
  Lexer lexer = new UnicodeLexer(charStream);
  CommonTokenStream tokens = new CommonTokenStream(lexer);
  tokens.fill();
  for (Token token : tokens.getTokens()) {
    System.out.println("Got token: " + token.toString());
  }
}

2 词法中的Unicode代码点

对于U+FFFF以内的字符，使用\u和4个16进制字符表示。

大于U+FFFF的字符，使用\u{12345}形式表示。

可以包含Unicode属性。因为属性是针对一组字符，因此仅可以在词法字符集中使用。

CYRILLIC : '\u0400'..'\u04FF' ; // or [\u0400-\u04FF] without quotes
EMOTICONS : ('\u{1F600}' | '\u{1F602}' | '\u{1F615}') ; // or [\u{1F600}\u{1F602}\u{1F615}]
EMOJI : [\p{Emoji}] ;
JAPANESE : [\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}] ;
NOT_CYRILLIC : [\P{Script=Cyrillic}] ;

详见lexer-rules.md

3 迁移

// 版本4.6默认使用调用环境的编码
CharStream input = new ANTLRFileStream("myinputfile");
JavaLexer lexer = new JavaLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);

// 版本4.7默认使用UTF-8
CharStream input = CharStreams.fromFileName("inputfile");
// CharStream input = CharStreams.fromFileName("inputfile", Charset.forName("windows-1252"));
JavaLexer lexer = new JavaLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);

(1) 动机

代码修改代价高于新增API
新API性能表现不亚于已有API

注意避免同时使用新旧API，否则将导致性能降级。

(2) 已有语法

需要先迁移解析器，在迁移词法器。

版本4.7以前，应用代码可以直接通过Token.getStartIndex()和Token.getStopIndex()传递给Java或C#。因为两者都是用UTF-16作为底层。

新的输入流需要转换为UTF-16。示例逻辑如下：

public final class CodePointCounter {
  private final String input;
  public int inputIndex = 0;
  public int codePointIndex = 0;
  
  public int advanceToIndex(int newCodePointIndex) {
    assert newCodePointIndex >= codePointIndex;
    while (codePointIndex < newCodePointOffset) {
        int codePoint = Character.codePointAt(input, inputIndex);
        inputIndex += Character.charCount(codePoint);
        codePointIndex++;
    }
    return inputIndex;
  }
}

(3) 字符缓存和非缓存流

ANTLR会在创建流时缓存所有的输入。

目前仅Java、C++和C#支持非缓存流。

注意使用非缓存流会影响解析树创建。因为解析树需要指向记号（指向非缓存流中的位置或拷贝出来）都会影响使用非缓存流。

非缓存流主要用于无限流，要求手动控制缓存。

以下示例使用了UnbufferedCharStream和UnbufferedTokenStream：

CharStream input = new UnbufferedCharStream(is);
CSVLexer lex = new CSVLexer(input); // copy text out of sliding buffer and store in tokens
// new CommonTokenFactory(true)指示在记号创建后显示使用CommonToken.setText设置文本，用于非缓存流
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
CSVParser parser = new CSVParser(tokens);
parser.setBuildParseTree(false);
parser.file();

语法需要在消失前使用内嵌的动作访问创建的记号，如：

1	data : a=INT {int x = Integer.parseInt($a.text);} ;