class Oga::XML::Lexer

Low level lexer that supports both XML and HTML (using an extra option). To lex HTML input set the ‘:html` option to `true` when creating an instance of the lexer:

lexer = Oga::XML::Lexer.new(:html => true)

This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.

## Thread Safety

Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:

# Don't do this!
lexer   = Oga::XML::Lexer.new('....')
threads = []

2.times do
  threads << Thread.new do
    lexer.advance do |*args|
      p args
    end
  end
end

threads.each(&:join)

However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.

## Strict Mode

By default the lexer is rather permissive regarding the input. For example, missing closing tags are inserted by default. To disable this behaviour the lexer can be run in “strict mode” by setting ‘:strict` to `true`:

lexer = Oga::XML::Lexer.new('...', :strict => true)

Strict mode only applies to XML documents.

@private

Constants

HTML_CLOSE_SELF

Elements that should be closed automatically before a new opening tag is processed.

HTML_SCRIPT

These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.

HTML_SCRIPT_ELEMENTS
HTML_STYLE
HTML_TABLE_ALLOWED

Elements that are allowed directly in a <table> element.

HTML_TABLE_ROW_ELEMENTS

The elements that may occur in a thead, tbody, or tfoot.

Technically “th” is not allowed per the HTML5 spec, but it’s so commonly used in these elements that we allow it anyway.

LITERAL_HTML_ELEMENTS

Names of HTML tags of which the content should be lexed as-is.

Public Class Methods

new(data, options = {}) click to toggle source

@param [String|IO] data The data to lex. This can either be a String or

an IO instance.

@param [Hash] options

@option options [TrueClass|FalseClass] :html When set to ‘true` the

lexer will treat the input as HTML instead of XML. This makes it
possible to lex HTML void elements such as `<link href="">`.

@option options [TrueClass|FalseClass] :strict Enables/disables strict

parsing of XML documents, disabled by default.
# File lib/oga/xml/lexer.rb, line 115
def initialize(data, options = {})
  @data   = data
  @html   = options[:html]
  @strict = options[:strict] || false
  @line     = 1
  @elements = []
  reset_native
end

Public Instance Methods

advance(&block) click to toggle source

Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.

Each token is an Array in the following format:

[TYPE, VALUE]

The type is a symbol, the value is either nil or a String.

This method stores the supplied block in ‘@block` and resets it after the lexer loop has finished.

@yieldparam [Symbol] type @yieldparam [String] value @yieldparam [Fixnum] line

# File lib/oga/xml/lexer.rb, line 172
def advance(&block)
  @block = block

  read_data do |chunk|
    advance_native(chunk)
  end

  # Add any missing closing tags
  if !strict? and !@elements.empty?
    @elements.length.times { on_element_end }
  end
ensure
  @block = nil
end
html?() click to toggle source

@return [TrueClass|FalseClass]

# File lib/oga/xml/lexer.rb, line 188
def html?
  @html == true
end
html_script?() click to toggle source

@return [TrueClass|FalseClass]

# File lib/oga/xml/lexer.rb, line 198
def html_script?
  html? && current_element == HTML_SCRIPT
end
html_style?() click to toggle source

@return [TrueClass|FalseClass]

# File lib/oga/xml/lexer.rb, line 203
def html_style?
  html? && current_element == HTML_STYLE
end
lex() click to toggle source

Gathers all the tokens for the input and returns them as an Array.

@see advance @return [Array]

# File lib/oga/xml/lexer.rb, line 147
def lex
  tokens = []

  advance do |type, value, line|
    tokens << [type, value, line]
  end

  tokens
end
read_data() { |data| ... } click to toggle source

Yields the data to lex to the supplied block.

@return [String] @yieldparam [String]

# File lib/oga/xml/lexer.rb, line 128
def read_data
  if @data.is_a?(String)
    yield @data

  # IO, StringIO, etc
  # THINK: read(N) would be nice, but currently this screws up the C code
  elsif @data.respond_to?(:each_line)
    @data.each_line { |line| yield line }

  # Enumerator, Array, etc
  elsif @data.respond_to?(:each)
    @data.each { |chunk| yield chunk }
  end
end
strict?() click to toggle source

@return [TrueClass|FalseClass]

# File lib/oga/xml/lexer.rb, line 193
def strict?
  @strict
end

Private Instance Methods

add_element(name) click to toggle source

@param [String] name

# File lib/oga/xml/lexer.rb, line 381
def add_element(name)
  @elements << name

  add_token(:T_ELEM_NAME, name)
end
add_token(type, value = nil) click to toggle source

Calls the supplied block with the information of the current token.

@param [Symbol] type The token type. @param [String] value The token value.

@yieldparam [String] type @yieldparam [String] value @yieldparam [Fixnum] line

# File lib/oga/xml/lexer.rb, line 222
def add_token(type, value = nil)
  @block.call(type, value, @line)
end
advance_line(amount = 1) click to toggle source

@param [Fixnum] amount The amount of lines to advance.

# File lib/oga/xml/lexer.rb, line 210
def advance_line(amount = 1)
  @line += amount
end
before_html_element_name(name) click to toggle source

Handles inserting of any missing tags whenever a new HTML tag is opened.

@param [String] name

# File lib/oga/xml/lexer.rb, line 361
def before_html_element_name(name)
  close_current = HTML_CLOSE_SELF[current_element]

  if close_current and !close_current.allow?(name)
    on_element_end
  end

  # Close remaining parent elements. This for example ensures that a
  # "<tbody>" not only closes an unclosed "<th>" but also the surrounding,
  # unclosed "<tr>".
  while close_current = HTML_CLOSE_SELF[current_element]
    if close_current.allow?(name)
      break
    else
      on_element_end
    end
  end
end
current_element() click to toggle source

Returns the name of the element we’re currently in.

@return [String]

# File lib/oga/xml/lexer.rb, line 229
def current_element
  @elements.last
end
on_attribute(value) click to toggle source

Called on tag attributes.

@param [String] value

# File lib/oga/xml/lexer.rb, line 448
def on_attribute(value)
  add_token(:T_ATTR, value)
end
on_attribute_ns(value) click to toggle source

Called on attribute namespaces.

@param [String] value

# File lib/oga/xml/lexer.rb, line 441
def on_attribute_ns(value)
  add_token(:T_ATTR_NS, value)
end
on_cdata_body(value) click to toggle source

Called for the body of a CDATA tag.

@param [String] value

# File lib/oga/xml/lexer.rb, line 294
def on_cdata_body(value)
  add_token(:T_CDATA_BODY, value)
end
on_cdata_end() click to toggle source

Called on the closing CDATA tag.

# File lib/oga/xml/lexer.rb, line 287
def on_cdata_end
  add_token(:T_CDATA_END)
end
on_cdata_start() click to toggle source

Called on the open CDATA tag.

# File lib/oga/xml/lexer.rb, line 282
def on_cdata_start
  add_token(:T_CDATA_START)
end
on_comment_body(value) click to toggle source

Called on a comment.

@param [String] value

# File lib/oga/xml/lexer.rb, line 311
def on_comment_body(value)
  add_token(:T_COMMENT_BODY, value)
end
on_comment_end() click to toggle source

Called on the closing comment tag.

# File lib/oga/xml/lexer.rb, line 304
def on_comment_end
  add_token(:T_COMMENT_END)
end
on_comment_start() click to toggle source

Called on the open comment tag.

# File lib/oga/xml/lexer.rb, line 299
def on_comment_start
  add_token(:T_COMMENT_START)
end
on_doctype_end() click to toggle source

Called on the end of a doctype.

# File lib/oga/xml/lexer.rb, line 270
def on_doctype_end
  add_token(:T_DOCTYPE_END)
end
on_doctype_inline(value) click to toggle source

Called on an inline doctype block.

@param [String] value

# File lib/oga/xml/lexer.rb, line 277
def on_doctype_inline(value)
  add_token(:T_DOCTYPE_INLINE, value)
end
on_doctype_name(value) click to toggle source

Called on the identifier specifying the name of the doctype.

@param [String] value

# File lib/oga/xml/lexer.rb, line 265
def on_doctype_name(value)
  add_token(:T_DOCTYPE_NAME, value)
end
on_doctype_start() click to toggle source

Called when a doctype starts.

# File lib/oga/xml/lexer.rb, line 251
def on_doctype_start
  add_token(:T_DOCTYPE_START)
end
on_doctype_type(value) click to toggle source

Called on the identifier specifying the type of the doctype.

@param [String] value

# File lib/oga/xml/lexer.rb, line 258
def on_doctype_type(value)
  add_token(:T_DOCTYPE_TYPE, value)
end
on_element_end(name = nil) click to toggle source

Called on the closing tag of an element.

@param [String] name The name of the element (minus namespace

prefix). This is not set for self closing tags.
# File lib/oga/xml/lexer.rb, line 411
def on_element_end(name = nil)
  return if @elements.empty?

  if html? and name and @elements.include?(name)
    while current_element != name
      add_token(:T_ELEM_END)
      @elements.pop
    end
  end

  # Prevents a superfluous end tag of a self-closing HTML tag from
  # closing its parent element prematurely
  return if html? && name && name != current_element

  add_token(:T_ELEM_END)
  @elements.pop
end
on_element_name(name) click to toggle source

Called on the name of an element.

@param [String] name The name of the element, including namespace.

# File lib/oga/xml/lexer.rb, line 352
def on_element_name(name)
  before_html_element_name(name) if html?

  add_element(name)
end
on_element_ns(namespace) click to toggle source

Called on the element namespace.

@param [String] namespace

# File lib/oga/xml/lexer.rb, line 390
def on_element_ns(namespace)
  add_token(:T_ELEM_NS, namespace)
end
on_element_open_end() click to toggle source

Called on the closing ‘>` of the open tag of an element.

# File lib/oga/xml/lexer.rb, line 395
def on_element_open_end
  return unless html?

  # Only downcase the name if we can't find an all lower/upper version of
  # the element name. This can save us a *lot* of String allocations.
  if HTML_VOID_ELEMENTS.allow?(current_element) \
  or HTML_VOID_ELEMENTS.allow?(current_element.downcase)
    add_token(:T_ELEM_END)
    @elements.pop
  end
end
on_proc_ins_body(value) click to toggle source

Called on the body of a processing instruction.

@param [String] value

# File lib/oga/xml/lexer.rb, line 340
def on_proc_ins_body(value)
  add_token(:T_PROC_INS_BODY, value)
end
on_proc_ins_end() click to toggle source

Called on the end of a processing instruction.

# File lib/oga/xml/lexer.rb, line 345
def on_proc_ins_end
  add_token(:T_PROC_INS_END)
end
on_proc_ins_name(value) click to toggle source

Called on a processing instruction name.

@param [String] value

# File lib/oga/xml/lexer.rb, line 333
def on_proc_ins_name(value)
  add_token(:T_PROC_INS_NAME, value)
end
on_proc_ins_start() click to toggle source

Called on the start of a processing instruction.

# File lib/oga/xml/lexer.rb, line 326
def on_proc_ins_start
  add_token(:T_PROC_INS_START)
end
on_string_body(value) click to toggle source

Called when processing the body of a string.

@param [String] value The data between the quotes.

# File lib/oga/xml/lexer.rb, line 246
def on_string_body(value)
  add_token(:T_STRING_BODY, value)
end
on_string_dquote() click to toggle source

Called when processing a double quote.

# File lib/oga/xml/lexer.rb, line 239
def on_string_dquote
  add_token(:T_STRING_DQUOTE)
end
on_string_squote() click to toggle source

Called when processing a single quote.

# File lib/oga/xml/lexer.rb, line 234
def on_string_squote
  add_token(:T_STRING_SQUOTE)
end
on_text(value) click to toggle source

Called on regular text values.

@param [String] value

# File lib/oga/xml/lexer.rb, line 432
def on_text(value)
  return if value.empty?

  add_token(:T_TEXT, value)
end
on_xml_decl_end() click to toggle source

Called on the end of an XML declaration tag.

# File lib/oga/xml/lexer.rb, line 321
def on_xml_decl_end
  add_token(:T_XML_DECL_END)
end
on_xml_decl_start() click to toggle source

Called on the start of an XML declaration tag.

# File lib/oga/xml/lexer.rb, line 316
def on_xml_decl_start
  add_token(:T_XML_DECL_START)
end