Class TextFormat.Tokenizer

java.lang.Object
com.google.protobuf.TextFormat.Tokenizer
Enclosing class:
TextFormat

private static final class TextFormat.Tokenizer extends Object
Represents a stream of tokens parsed from a String.

The Java standard library provides many classes that you might think would be useful for implementing this, but aren't. For example:

  • java.io.StreamTokenizer: This almost does what we want -- or, at least, something that would get us close to what we want -- except for one fatal flaw: It automatically un-escapes strings using Java escape sequences, which do not include all the escape sequences we need to support (e.g. '\x').
  • java.util.Scanner: This seems like a great way at least to parse regular expressions out of a stream (so we wouldn't have to load the entire input into a single string before parsing). Sadly, Scanner requires that tokens be delimited with some delimiter. Thus, although the text "foo:" should parse to two tokens ("foo" and ":"), Scanner would recognize it only as a single token. Furthermore, Scanner provides no way to inspect the contents of delimiters, making it impossible to keep track of line and column numbers.

Luckily, Java's regular expression support does manage to be useful to us. (Barely: We need Matcher.usePattern(), which is new in Java 1.5.) So, we can use that, at least. Unfortunately, this implies that we need to have the entire input in one contiguous string.