Class Tokenizer


  • public class Tokenizer
    extends java.lang.Object
    Simpe XML Tokenizer (SXT) performs input stream tokenizing. Advantages:
    • utility class to simplify creation of XML parsers, especially suited for pull event model but can support also push (SAX2)
    • small footprint: whole tokenizer is in one file
    • minimal memory utilization: does not use memory except for input and content buffer (that can grow in size)
    • fast: all parsing done in one function (simple automata)
    • supports most of XML 1.0 (except validation and external entities)
    • low level: supports on demand parsing of Characters, CDSect, Comments, PIs etc.)
    • parsed content: supports providing on demand parsed content to application (standard entities expanded all CDATA sections inserted, Comments and PIs removed) not for attribute values and element content
    • mixed content: allow to dynamically disable mixed content
    • small - total compiled size around 15K
    Limitations:
    • it is just a tokenizer - does not enforce grammar
    • readName() is using Java identifier rules not XML
    • does not parse DOCTYPE declaration (skips everyting in [...])
    Author:
    Aleksander Slominski
    • Field Detail

      • paramNotifyCharacters

        public boolean paramNotifyCharacters
      • paramNotifyComment

        public boolean paramNotifyComment
      • paramNotifyCDSect

        public boolean paramNotifyCDSect
      • paramNotifyDoctype

        public boolean paramNotifyDoctype
      • paramNotifyPI

        public boolean paramNotifyPI
      • paramNotifyCharRef

        public boolean paramNotifyCharRef
      • paramNotifyEntityRef

        public boolean paramNotifyEntityRef
      • paramNotifyAttValue

        public boolean paramNotifyAttValue
      • buf

        public char[] buf
      • pos

        public int pos
        position of next char that will be read from buffer
      • posStart

        public int posStart
        Range [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == false
      • posEnd

        public int posEnd
      • posNsColon

        public int posNsColon
      • nsColonCount

        public int nsColonCount
      • seenContent

        public boolean seenContent
      • parsedContent

        public boolean parsedContent
        This falg decides which buffer will be used to retrieve content for current token. If true use pc and [pcStart, pcEnd) and if false use buf and [posStart, posEnd)
      • pc

        public char[] pc
        This is buffer for parsed content such as actual valuue of entity ('&lt;' in buf but in pc it is '<')
      • pcStart

        public int pcStart
        Range [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == false
      • pcEnd

        public int pcEnd
      • lookupNameStartChar

        protected static boolean[] lookupNameStartChar
      • lookupNameChar

        protected static boolean[] lookupNameChar
    • Constructor Detail

      • Tokenizer

        public Tokenizer()
    • Method Detail

      • reset

        public void reset()
      • setInput

        public void setInput​(java.io.Reader r)
        Reset tokenizer state and set new input source
      • setInput

        public void setInput​(char[] data)
        Reset tokenizer state and set new input source
      • setInput

        public void setInput​(char[] data,
                             int off,
                             int len)
      • setNotifyAll

        public void setNotifyAll​(boolean enable)
        Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).
      • setParseContent

        public void setParseContent​(boolean enable)
        Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).
      • isAllowedMixedContent

        public boolean isAllowedMixedContent()
      • setAllowedMixedContent

        public void setAllowedMixedContent​(boolean enable)
        Set support for mixed conetent. If mixed content is disabled tokenizer will do its best to ensure that no element has mixed content model also ignorable whitespaces will not be reported as element content.
      • getSoftLimit

        public int getSoftLimit()
      • setSoftLimit

        public void setSoftLimit​(int value)
                          throws TokenizerException
        Set soft limit on internal buffer size. That means suggested size that tokznzier will try to keep.
        Throws:
        TokenizerException
      • getHardLimit

        public int getHardLimit()
      • setHardLimit

        public void setHardLimit​(int value)
                          throws TokenizerException
        Set hard limit on internal buffer size. That means that if input (such as element content) is bigger than hard limit size tokenizer will throw XmlTokenizerBufferOverflowException.
        Throws:
        TokenizerException
      • getBufferShrinkOffset

        public int getBufferShrinkOffset()
      • isBufferShrinkable

        public boolean isBufferShrinkable()
      • getPosDesc

        public java.lang.String getPosDesc()
        Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.
      • getLineNumber

        public int getLineNumber()
      • getColumnNumber

        public int getColumnNumber()
      • isNameStartChar

        protected boolean isNameStartChar​(char ch)
      • isNameChar

        protected boolean isNameChar​(char ch)
      • isS

        protected boolean isS​(char ch)
        Determine if ch is whitespace ([3] S)
      • next

        public byte next()
                  throws TokenizerException,
                         java.io.IOException
        Return next recognized toke or END_DOCUMENT if no more input.

        This is simple automata (in pseudo-code):

         byte next() {
            while(state != END_DOCUMENT) {
              ch = more();  // read character from input
              state = func(ch, state); // do transition
              if(state is accepting)
                return state;  // return token to caller
            }
         }
         

        For speed (and simplicity?) it is using few procedures such as readName() or isS().

        Throws:
        TokenizerException
        java.io.IOException