class Traject::ExperimentalNokogiriStreamingReader::PathTracker

initialized with the specification (a very small subset of xpath) for what records to yield-on-each. Tests to see if a Nokogiri::XML::Reader node matches spec.

'//record'

or anchored to root:

'/body/head/meta' same thing as './body/head/meta' or 'head/meta'

Elements can (and must, to match) have XML namespaces, if and only if they are registered with settings nokogiri.namespaces

sadly JRuby Nokogiri has an incompatibility with true nokogiri, and doesn't preserve our namespaces on outer_xml, so in JRuby we have to track them ourselves, and then also do yet ANOTHER parse in nokogiri. This may make this in Java even LESS performant, I'm afraid.

Attributes

clipboard[R]
current_path[R]
extra_xpath_hooks[R]
inverted_namespaces[R]
namespaces_stack[R]
path_spec[R]

Public Class Methods

new(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 133
def initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {})
  @inverted_namespaces  = namespaces.invert
  @clipboard = clipboard
  # We're guessing using a string will be more efficient than an array
  @current_path         = ""
  @floating             = false

  @path_spec, @floating = parse_path(str_spec)

  @namespaces_stack = []


  @extra_xpath_hooks = extra_xpath_hooks.collect do |path, callable|
    bare_path, floating = parse_path(path)
    {
      path: bare_path,
      floating: floating,
      callable: callable
    }
  end
end

Public Instance Methods

current_node_doc() click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 195
def current_node_doc
  return nil unless @current_node

  # yeah, sadly we got to have nokogiri parse it again
  fix_namespaces(Nokogiri::XML.parse(@current_node.outer_xml))
end
fix_namespaces(doc) click to toggle source

no-op unless it's jruby, and then we use our namespace stack to correctly add namespaces to the Nokogiri::XML::Document, cause in Jruby outer_xml on the Reader doesn't do it for us. :(

# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 241
def fix_namespaces(doc)
  if is_jruby?
    # Only needed in jruby, nokogiri's jruby implementation isn't weird
    # around namespaces in exactly the same way as MRI. We need to keep
    # track of the namespaces in outer contexts ourselves, and then see
    # if they are needed ourselves. :(
    namespaces = namespaces_stack.compact.reduce({}, :merge)
    default_ns = namespaces.delete("xmlns")

    namespaces.each_pair do |attrib, uri|
      ns_prefix = attrib.sub(/\Axmlns:/, '')

      # gotta make sure it's actually used in the doc to not add it
      # unecessarily. GAH.
      if    doc.xpath("//*[starts-with(name(), '#{ns_prefix}:')][1]").empty? &&
            doc.xpath("//@*[starts-with(name(), '#{ns_prefix}:')][1]").empty?
        next
      end
      doc.root.add_namespace_definition(ns_prefix, uri)
    end

    if default_ns
      doc.root.default_namespace = default_ns
      # OMG nokogiri, really?
      default_ns = doc.root.namespace
      doc.xpath("//*[namespace-uri()='']").each do |node|
        node.namespace = default_ns
      end
    end

  end
  return doc
end
floating?() click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 212
def floating?
  !!@floating
end
is_jruby?() click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 170
def is_jruby?
  Traject::Util.is_jruby?
end
match?() click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 216
def match?
  match_path?(path_spec, floating: floating?)
end
match_path?(path_to_match, floating:) click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 220
def match_path?(path_to_match, floating:)
  if floating?
    current_path.end_with?(path_to_match)
  else
    current_path == path_to_match
  end
end
pop() click to toggle source

removes the last slash-separated component from current_path

# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 203
def pop
  current_path.slice!( current_path.rindex('/')..-1 )
  @current_node = nil

  if is_jruby?
    namespaces_stack.pop
  end
end
push(reader_node) click to toggle source

adds a component to slash-separated current_path, with namespace prefix.

# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 175
def push(reader_node)
  namespace_prefix = reader_node.namespace_uri && inverted_namespaces[reader_node.namespace_uri]

  # gah, reader_node.name has the namespace prefix in there
  node_name = reader_node.name.gsub(/[^:]+:/, '')

  node_str = if namespace_prefix
    namespace_prefix + ":" + node_name
  else
    reader_node.name
  end

  current_path << ("/" + node_str)

  if is_jruby?
    namespaces_stack << reader_node.namespaces
  end
  @current_node = reader_node
end
run_extra_xpath_hooks() click to toggle source
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 228
def run_extra_xpath_hooks
  return unless @current_node

  extra_xpath_hooks.each do |hook_spec|
    if match_path?(hook_spec[:path], floating: hook_spec[:floating])
      hook_spec[:callable].call(current_node_doc, clipboard)
    end
  end
end

Protected Instance Methods

parse_path(str_spec) click to toggle source

returns [bare_path, is_floating]

# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 156
          def parse_path(str_spec)
  floating = false

  if str_spec.start_with?('//')
    str_spec = str_spec.slice(2..-1)
    floating = true
  else
    str_spec = str_spec.slice(1..-1) if str_spec.start_with?(".")
    str_spec = "/" + str_spec unless str_spec.start_with?("/")
  end

  return [str_spec, floating]
end