class CHECKING::YOU::MrMIME

Push-event-based parser for freedesktop-dot-org `shared-mime-info`-format XML package files, including the main `shared-mime-info` database itself (GPLv2+), Apache Tika (MIT), and our own (AGPLv3). specifications.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html gitlab.freedesktop.org/xdg/shared-mime-info/-/blob/master/src/update-mime-database.c

Example pulled from `freedesktop.org.xml.in`:

<mime-type type="application/vnd.oasis.opendocument.text">
  <comment>ODT document</comment>
  <acronym>ODT</acronym>
  <expanded-acronym>OpenDocument Text</expanded-acronym>
  <sub-class-of type="application/zip"/>
  <generic-icon name="x-office-document"/>
  <magic priority="70">
    <match type="string" value="PK\003\004" offset="0">
      <match type="string" value="mimetype" offset="30">
        <match type="string" value="application/vnd.oasis.opendocument.text" offset="38"/>
      </match>
    </match>
  </magic>
  <glob pattern="*.odt"/>
</mime-type>

Constants

BASED_STRING

Turn an arbitrary String into the correctly-based Integer it represents. It would be nice if I could do this directly in `Ox::Sax::Value`. Base-16 Ints can be written as literals in Ruby, e.g. irb> 0xFF

> 255

DEFAULT_LOADS

`MrMIME::new` will take any of these keys as keyword arguments whose value will override the default defined in this Hash.

FDO_ELEMENT_CATEGORY

Map of `shared-mime-info` XML Element names to our generic category names.

FDO_MAGIC_FORMATS
ORIGIN_OF_SYMMETRY

Public Class Methods

new(**kwargs) click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 144
def initialize(**kwargs)
  # Per the `Ox::Sax` dox:
  # "Initializing `line` attribute in the initializer will cause that variable to
  #    be updated before each callback with the XML line number.
  #  The same is true for the `column` attribute, but it will be updated with
  #    the column in the XML file that is the start of the element or node just read.
  #  `@pos`, if defined, will hold the number of bytes from the start of the document."
  #@pos = nil
  #@line = nil
  #@column = nil

  # We receive separate events for Elements and Attributes, so we need to keep track of
  # the current Element to know what to do with Attributes since we can't rely on Attribute
  # names to be unique. For example, `shared-mime-info` has an attribute `type` on
  # `<mime-type>`, `<alias>`, `<match>`, and `<sub-class-of>`.
  @parse_stack = Array.new

  # We need a separate stack and a flag boolean for building content-match structures since the source XML
  # represents OR and AND byte-sequence relationships as sibling and child Elements respectively, e.g.
  # <magic><match1><match2/><match3/></match1><match4/></magic> => [m1 AND m2] OR [m1 AND m3], OR [m4].
  @i_can_haz_magic = true

  # Allow the user to control the conditions on which we ignore data from the source XML.
  @skips = self.class::DEFAULT_LOADS.merge(kwargs.slice(*self.class::DEFAULT_LOADS.keys)).keep_if { |k,v|
    v == false
  }.keys.to_set

  # Here's where I would put a call to `super()` a.k.a `Ox::Sax#initialize` — IF IT HAD ONE
end

Public Instance Methods

attr_value(attr_name, value) click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 239
def attr_value(attr_name, value)
  return if self.skips.include?(@parse_stack.last)
  # `parse_stack` can be empty here in which case its `#last` will be `nil`.
  # This happens e.g. for the two attributes of the XML declaration '<?xml version="1.0" encoding="UTF-8"?>'.
  #
  # Avoid the `Array` allocation necessary when using pattern matching `case` syntax.
  # I'd still like to refactor this to avoid the redundant `attr_name` `case`s.
  # Maybe a `Hash` of `proc`s?
  case @parse_stack.last
  when :"mime-type" then
    case attr_name
    when :type then @media_type.replace(value.as_s)
    end
  when :match then
    case attr_name
    when :type then @speedy_cat.last.format = FDO_MAGIC_FORMATS[value.as_s]
    when :value then @speedy_cat.last.cat = value.as_s
    when :offset then @speedy_cat.last.boundary = value.as_s
    when :mask then @speedy_cat.last.mask = BASED_STRING.call(value.as_s)
    end
  when :magic then
    case attr_name
    when :priority then @speedy_cat&.weight = value.as_i
    end
  when :alias then
    case attr_name
    when :type then self.cyo.add_aka(::CHECKING::YOU::IN::from_ietf_media_type(value.as_s))
    end
  when :"sub-class-of" then
    case attr_name
    when :type then self.cyo.add_parent(::CHECKING::YOU::OUT::from_ietf_media_type(value.as_s))
    end
  when :glob then
    case attr_name
    when :weight then @stick_around.weight = value.as_i
    when :pattern then @stick_around.replace(value.as_s)
    when :"case-sensitive" then @stick_around.case_sensitive = value.as_bool
    end
  when :"root-XML" then
    #case attr_name
    #when :namespaceURI then  # TODO
    #when :localName then  # TODO
    #end
  end
end
cyo() click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 174
def cyo
  @cyo ||= ::CHECKING::YOU::OUT::from_ietf_media_type(@media_type)
end
end_element(name) click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 293
def end_element(name)
  raise Exception.new('Parse stack element mismatch') unless @parse_stack.pop == name
  return if self.skips.include?(@parse_stack.last)
  case name
  when :"mime-type" then
    @media_type.clear
    @cyo = nil
  when :match then
    # The Sequence stack represents a complete match once we start popping Sequences from it,
    # which we can know because every `<match>` stack push sets `@i_can_haz_magic = true`.
    # If there is only a single sub-sequence we can just add that instead of the container.
    self.cyo.add_content_match(
      # Add single-sequences directly instead of adding their container.
      @speedy_cat.one? ?
        # Transfer any non-default `weight` from the container to that single-sequence.
        @speedy_cat.pop.tap { _1.weight = @speedy_cat.weight } :
        # Otherwise go ahead and add a copy of the container while also preparing the
        # local container for a possible next-branch to the `<magic>` tree.
        @speedy_cat.dup.tap { @speedy_cat.pop }
    ) if @i_can_haz_magic
    # Mark any remaining partial Sequences as ineligible to be a full match candidate,
    # e.g. if we had a stack of [<match1/><match2/><match3/>] we would want to add a
    # candidate [m1, m2, m3] but not the partials [m1, m2] or [m1] as we clear out the stack.
    @i_can_haz_magic = false
  when :magic then
    # `SpeedyCat#clear` will unset any non-default `weight` so we can re-use it cleanly.
    @speedy_cat.clear
  when :glob then
    self.cyo.add_pathname_fragment(@stick_around) unless @stick_around.nil?
  end
end
open(path, **kwargs) click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 325
def open(path, **kwargs)
  # Use the block form of `IO::open` so the file handle is implicitly closed after we leave this scope.
  # Per Ruby's `IO` module docs:
  # "With no associated block, `::open` is a synonym for `::new`.  If the optional code block is given,
  #  it will be passed the opened file as an argument and the File object will automatically be closed
  #  when the block terminates.  The value of the block will be returned from `::open`."
  File.open(path, File::Constants::RDONLY) { |mime_xml|

    # "Announce an intention to access data from the current file in a specific pattern.
    # On platforms that do not support the posix_fadvise(2) system call, this method is a no-op."
    #
    # This was probably a bigger deal when we all stored our files on spinning rust, but it shouldn't hurt :)
    #
    # I'm using `:sequential` because I am doing event-based XML parsing, and I'm avoiding `:noreuse`
    # because back-to-back invocations of DistorteD will benefit from the OS caching the data files.
    #
    # N0TE: `:noreuse` is a no-op on Lunix anyway, at least as of ver 5.12 as I write this in 2021:
    # https://linux.die.net/man/2/posix_fadvise
    # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/fadvise.c
    # https://web.archive.org/web/20130513093816/http://kerneltrap.org/node/7563
    #
    # But it works on FreeBSD, amusingly even when using Linux ABI compat:
    # https://www.freebsd.org/cgi/man.cgi?query=posix_fadvise&sektion=2
    # https://cgit.freebsd.org/src/tree/sys/kern/vfs_syscalls.c
    # https://cgit.freebsd.org/src/tree/sys/compat/linux/linux_file.c
    mime_xml.advise(:sequential)

    # Docs: http://www.ohler.com/ox/Ox.html#method-c-sax_parse
    #
    # Code for this method is defined in Ox's C extension, not in Ox's Ruby lib:
    #   https://github.com/ohler55/ox/blob/master/ext/ox/ox.c  CTRL+F "call-seq: sax_parse".
    #   (Not linking a particular line number since that would require linking a particular revision too)
    #
    # Here is an actual example `<magic><match/>` element pulled from `freedesktop.org.xml`
    # showing why I am using Ox's `:convert_special` here:
    #   `<match type="string" value="&lt;&lt;&lt; QEMU VM Virtual Disk Image >>>\n" offset="0"/>`
    #
    # I can't figure out how exactly `:skip_none` and `:skip_off` differ here.
    # They appear a few times as equal/fall-through `case`s in some C extension `switch` statements:
    #   https://github.com/ohler55/ox/search?q=OffSkip
    #   https://github.com/ohler55/ox/search?q=NoSkip
    # In `sax.c`'s `read_text` function they are used slightly differently in the following conditional
    # where `:skip_none` checks if the end of the Element has been reached but `:skip_off` doesn't.
    #   https://github.com/ohler55/ox/blob/master/ext/ox/sax.c  CTRL+F "read_text"
    #   `((NoSkip == dr->options.skip && !isEnd) || (OffSkip == dr->options.skip)))`
    #
    # TOD0: Probably String allocation gainz to be had inside Ox's C extension once the API is available:
    # https://bugs.ruby-lang.org/issues/13381
    # https://bugs.ruby-lang.org/issues/16029
    # e.g. https://github.com/msgpack/msgpack-ruby/pull/196
    Ox.sax_parse(
      self,                     # Instance of a class that responds to `Ox::Sax`'s callback messages.
      mime_xml,                 # IO stream or String of XML to parse. Won't close File handles automatically.
      **{
        convert_special: true,  # [boolean] Convert encoded entities back to their unencoded form, e.g. `"&lt"` to `"<"`.
        skip: :skip_off,        # [:skip_none|:skip_return|:skip_white|:skip_off] (from Element text/value) Strip CRs, whitespace, or nothing.
        smart: false,           # [boolean] Toggle Ox's built-in hints for HTML parsing: https://github.com/ohler55/ox/blob/master/ext/ox/sax_hint.c
        strip_namespace: true,  # [nil|String|true|false] (from Element names) Strip no namespaces, all namespaces, or a specific namespace.
        symbolize: true,        # [boolean] Fill callback method `name` arguments with Symbols instead of with Strings.
        intern_strings: true,   # [boolean] Intern (freeze and deduplicate) String return values.
      }.update(kwargs),
    )
  }
end
skips() click to toggle source

You shouldn't abuse the power of the Solid.

# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 140
def skips
  @skips ||= self.class::DEFAULT_LOADS.keep_if { |k,v| v == false }.keys.to_set
end
start_element(name) click to toggle source

Callback methods we can implement in this Handler per www.ohler.com/ox/Ox/Sax.html

def instruct(target); end def end_instruct(target); end def attr(name, str); end def attr_value(name, value); end def attrs_done(); end def doctype(str); end def comment(str); end def cdata(str); end def text(str); end def value(value); end def start_element(name); end def end_element(name); end def error(message, line, column); end def abort(name); end

PROTIPs:

  • Our callback methods must be public or they won't be called!

  • Some argument names in the documented method definitions describe the type (in the Ruby sense) of the argument, and some argument names describe their semantic meaning w/r/t the XML document!

    • Arguments called `name` will be Symbols by default unless `Ox::parse_sax` was given an options hash where `:symbolize` => `false` in which case `name` arguments will contain Strings.

    • Arguments called `str` will be Strings always.

    • Arguments called `value` will be `Ox::Sax::Value` objects that can be further differentiated in several ways: www.ohler.com/ox/Ox/Sax/Value.html

  • For example, an invocation of `attr(name, str)` emitted while parsing a fd.o `<magic>` Element might have a `name` argument containing the Symbol `:priority` and a `str` argument containing its String value e.g. `“50”`.

  • The `value` versions of these callback methods have priority over their `str` equivalents if both are defined, and only one of them will ever be called, e.g. `attr_value()` > `attr()` iff `defined? attr_value()`.

# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 211
def start_element(name)
  @parse_stack.push(name)
  return if self.skips.include?(name)
  case name
  when :"mime-type" then
    @media_type = String.new if @media_type.nil?
    @cyo = nil
  when :match then
    # Mark any newly-added Sequence as eligible for a full match candidate.
    @i_can_haz_magic = true
    @speedy_cat.append(::CHECKING::YOU::OUT::SequenceCat.new)
  when :magic then
    @speedy_cat = ::CHECKING::YOU::OUT::SpeedyCat.new if @speedy_cat.nil?
  when :"magic-deleteall" then
    # TODO
  when :glob then
    @stick_around = ::CHECKING::YOU::StickAround.new
  when :"glob-deleteall" then
    # TODO
  when :treemagic then
    # TODO
  when :acronym then
    # TODO
  when :"expanded-acronym" then
    # TODO
  end
end
text(element_text) click to toggle source
# File lib/checking-you-out/ghost_revival/mr_mime.rb, line 285
def text(element_text)
  return if self.skips.include?(@parse_stack.last)
  case @parse_stack.last
  when :comment then
    self.cyo.description = element_text
  end
end