Interface HTMLParser

    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      java.lang.String getCleanedText​(java.lang.String string)
      Removes any string artifacts placed in the text by the parser.
      void parse​(java.net.URL baseURL, java.lang.String pageText, DocumentAdapter adapter)
      Parses the specified text string as a Document, registering it in the HTMLPage.
      boolean supportsForceTagCase()
      Returns true if this parser supports forcing the upper/lower case of tag and attribute names.
      boolean supportsParserWarnings()
      Returns true if this parser can display parser warnings.
      boolean supportsPreserveTagCase()
      Returns true if this parser supports preservation of the case of tag and attribute names.
      boolean supportsReturnHTMLDocument()
      Returns true if this parser can return an HTMLDocument object.
    • Method Detail

      • parse

        void parse​(java.net.URL baseURL,
                   java.lang.String pageText,
                   DocumentAdapter adapter)
            throws java.io.IOException,
                   org.xml.sax.SAXException
        Parses the specified text string as a Document, registering it in the HTMLPage. Any error reporting will be annotated with the specified URL.
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
      • getCleanedText

        java.lang.String getCleanedText​(java.lang.String string)
        Removes any string artifacts placed in the text by the parser. For example, a parser may choose to encode an HTML entity as a special character. This method should convert that character to normal text.
      • supportsPreserveTagCase

        boolean supportsPreserveTagCase()
        Returns true if this parser supports preservation of the case of tag and attribute names.
      • supportsForceTagCase

        boolean supportsForceTagCase()
        Returns true if this parser supports forcing the upper/lower case of tag and attribute names.
      • supportsReturnHTMLDocument

        boolean supportsReturnHTMLDocument()
        Returns true if this parser can return an HTMLDocument object.
      • supportsParserWarnings

        boolean supportsParserWarnings()
        Returns true if this parser can display parser warnings.