class Traject::ExperimentalNokogiriStreamingReader::PathTracker
initialized with the specification (a very small subset of xpath) for what records to yield-on-each. Tests to see if a Nokogiri::XML::Reader node matches spec.
'//record'
or anchored to root:
'/body/head/meta' same thing as './body/head/meta' or 'head/meta'
Elements can (and must, to match) have XML namespaces, if and only if they are registered with settings nokogiri.namespaces
sadly JRuby Nokogiri has an incompatibility with true nokogiri, and doesn't preserve our namespaces on outer_xml, so in JRuby we have to track them ourselves, and then also do yet ANOTHER parse in nokogiri. This may make this in Java even LESS performant, I'm afraid.
Attributes
Public Class Methods
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 133 def initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) @inverted_namespaces = namespaces.invert @clipboard = clipboard # We're guessing using a string will be more efficient than an array @current_path = "" @floating = false @path_spec, @floating = parse_path(str_spec) @namespaces_stack = [] @extra_xpath_hooks = extra_xpath_hooks.collect do |path, callable| bare_path, floating = parse_path(path) { path: bare_path, floating: floating, callable: callable } end end
Public Instance Methods
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 195 def current_node_doc return nil unless @current_node # yeah, sadly we got to have nokogiri parse it again fix_namespaces(Nokogiri::XML.parse(@current_node.outer_xml)) end
no-op unless it's jruby, and then we use our namespace stack to correctly add namespaces to the Nokogiri::XML::Document, cause in Jruby outer_xml on the Reader doesn't do it for us. :(
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 241 def fix_namespaces(doc) if is_jruby? # Only needed in jruby, nokogiri's jruby implementation isn't weird # around namespaces in exactly the same way as MRI. We need to keep # track of the namespaces in outer contexts ourselves, and then see # if they are needed ourselves. :( namespaces = namespaces_stack.compact.reduce({}, :merge) default_ns = namespaces.delete("xmlns") namespaces.each_pair do |attrib, uri| ns_prefix = attrib.sub(/\Axmlns:/, '') # gotta make sure it's actually used in the doc to not add it # unecessarily. GAH. if doc.xpath("//*[starts-with(name(), '#{ns_prefix}:')][1]").empty? && doc.xpath("//@*[starts-with(name(), '#{ns_prefix}:')][1]").empty? next end doc.root.add_namespace_definition(ns_prefix, uri) end if default_ns doc.root.default_namespace = default_ns # OMG nokogiri, really? default_ns = doc.root.namespace doc.xpath("//*[namespace-uri()='']").each do |node| node.namespace = default_ns end end end return doc end
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 212 def floating? !!@floating end
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 170 def is_jruby? Traject::Util.is_jruby? end
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 216 def match? match_path?(path_spec, floating: floating?) end
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 220 def match_path?(path_to_match, floating:) if floating? current_path.end_with?(path_to_match) else current_path == path_to_match end end
removes the last slash-separated component from current_path
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 203 def pop current_path.slice!( current_path.rindex('/')..-1 ) @current_node = nil if is_jruby? namespaces_stack.pop end end
adds a component to slash-separated current_path
, with namespace prefix.
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 175 def push(reader_node) namespace_prefix = reader_node.namespace_uri && inverted_namespaces[reader_node.namespace_uri] # gah, reader_node.name has the namespace prefix in there node_name = reader_node.name.gsub(/[^:]+:/, '') node_str = if namespace_prefix namespace_prefix + ":" + node_name else reader_node.name end current_path << ("/" + node_str) if is_jruby? namespaces_stack << reader_node.namespaces end @current_node = reader_node end
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 228 def run_extra_xpath_hooks return unless @current_node extra_xpath_hooks.each do |hook_spec| if match_path?(hook_spec[:path], floating: hook_spec[:floating]) hook_spec[:callable].call(current_node_doc, clipboard) end end end
Protected Instance Methods
returns [bare_path, is_floating]
# File lib/traject/experimental_nokogiri_streaming_reader.rb, line 156 def parse_path(str_spec) floating = false if str_spec.start_with?('//') str_spec = str_spec.slice(2..-1) floating = true else str_spec = str_spec.slice(1..-1) if str_spec.start_with?(".") str_spec = "/" + str_spec unless str_spec.start_with?("/") end return [str_spec, floating] end