class HexaPDF::Content::Parser

This class knows how to correctly parse a content stream.

Overview

A content stream is mostly just a stream of PDF objects. However, there is one exception: inline images.

Since inline images don't follow the normal PDF object parsing rules, they need to be handled specially and this is the reason for this class. Therefore only the BI operator is ever called for inline images because the ID and EI operators are handled by the parser.

To parse some contents the parse method needs to be called with the contents to be parsed and a Processor object which is used for processing the parsed operators.

Public Class Methods

parse(contents, processor = nil, &block) click to toggle source

Creates a new Parser object and calls parse.

# File lib/hexapdf/content/parser.rb, line 164
def self.parse(contents, processor = nil, &block)
  new.parse(contents, processor, &block)
end

Public Instance Methods

parse(contents, processor = nil) { |object, params| ... } click to toggle source

Parses the contents and calls the processor object or the given block for each parsed operator.

If a full-blown Processor is not needed (e.g. because the graphics state doesn't need to be maintained), one can use the block form to handle the parsed objects and their parameters.

Note: The parameters array is reused for each processed operator, so duplicate it if necessary.

# File lib/hexapdf/content/parser.rb, line 176
def parse(contents, processor = nil, &block) #:yields: object, params
  raise ArgumentError, "Argument processor or block is needed" if processor.nil? && block.nil?
  if processor.nil?
    block.singleton_class.send(:alias_method, :process, :call)
    processor = block
  end

  tokenizer = Tokenizer.new(contents, raise_on_eos: true)
  params = []
  loop do
    obj = tokenizer.next_object(allow_keyword: true)
    if obj.kind_of?(Tokenizer::Token)
      if obj == 'BI'
        params = parse_inline_image(tokenizer)
      end
      processor.process(obj.to_sym, params)
      params.clear
    else
      params << obj
    end
  end
end

Private Instance Methods

parse_inline_image(tokenizer) click to toggle source

Parses the inline image at the current position.

# File lib/hexapdf/content/parser.rb, line 204
def parse_inline_image(tokenizer)
  # BI has already been read, so read the image dictionary
  dict = {}
  while (key = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS)
    if key == 'ID'
      break
    elsif key == Tokenizer::NO_MORE_TOKENS
      raise HexaPDF::Error, "EOS while trying to read dictionary key for inline image"
    elsif !key.kind_of?(Symbol)
      raise HexaPDF::Error, "Inline image dictionary keys must be PDF name objects"
    end
    value = tokenizer.next_object rescue Tokenizer::NO_MORE_TOKENS
    if value == Tokenizer::NO_MORE_TOKENS
      raise HexaPDF::Error, "EOS while trying to read dictionary value for inline image"
    end
    dict[key] = value
  end

  # one whitespace character after ID
  tokenizer.next_byte

  real_end_found = false
  image_data = ''.b

  # find the EI operator and handle EI appearing inside the image data
  until real_end_found
    data = tokenizer.scan_until(/(?=EI(?:[#{Tokenizer::WHITESPACE}]|\z))/o)
    if data.nil?
      raise HexaPDF::Error, "End inline image marker EI not found"
    end
    image_data << data
    tokenizer.pos += 2
    last_pos = tokenizer.pos

    # Check if we found EI inside of the image data
    count = 0
    while count < MAX_TOKEN_CHECK
      token = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS
      if token == Tokenizer::NO_MORE_TOKENS
        count += MAX_TOKEN_CHECK
      elsif token.kind_of?(Tokenizer::Token) &&
          !Processor::OPERATOR_MESSAGE_NAME_MAP.key?(token.to_sym)
        break #  invalid token
      end
      count += 1
    end

    if count >= MAX_TOKEN_CHECK
      real_end_found = true
    else
      image_data << "EI"
    end
    tokenizer.pos = last_pos
  end

  [dict, image_data]
end