class HexaPDF::Content::Processor

This class is used for processing content operators extracted from a content stream.

General Information

When a content stream is read, operators and their operands are extracted. After extracting these operators are normally processed with a Processor instance that ensures that the needed setup (like modifying the graphics state) is done before further processing.

How Processing Works

The operator implementations (see the Operator module) are called first and they ensure that the processing state is consistent. For example, operators that modify the graphics state do actually modify the graphics_state object. However, operator implementations are only used for this task and not more, so they are very specific and normally don't need to be changed.

After that methods corresponding to the operator names are invoked on the processor object (if they exist). Each PDF operator name is mapped to a nicer message name via the OPERATOR_MESSAGE_NAME_MAP constant. For example, the operator 'q' is mapped to 'save_graphics_state“.

The task of these methods is to do something useful with the content itself, it doesn't need to concern itself with ensuring the consistency of the processing state. For example, the processor could use the processing state to extract the text. Or paint the content on a canvas.

For inline images only the 'BI' operator mapped to 'inline_image' is used. Although also the operators 'ID' and 'EI' exist for inline images, they are not used because they are consumed while parsing inline images and do not reflect separate operators.

Text Processing

Two utility methods decode_text and decode_text_with_positioning for extracting text are provided. Both can directly be invoked from the 'show_text' and 'show_text_with_positioning' methods.

Constants

OPERATOR_MESSAGE_NAME_MAP

Mapping of PDF operator names to message names that are sent to renderer implementations.

Attributes

graphics_object[RW]

The current graphics object.

It is not advised that this attribute is changed manually, it is automatically adjusted according to the processed operators!

This attribute can have the following values:

:none

No current graphics object, i.e. the processor is at the page description level.

:path

The current graphics object is a path.

:clipping_path

The current graphics object is a clipping path.

:text

The current graphics object is text.

See: PDF1.7 s8.2

graphics_state[R]

The GraphicsState object containing the current graphics state.

It is not advised that this attribute is changed manually, it is automatically adjusted according to the processed operators!

operators[R]

Mapping from operator name (Symbol) to a callable object.

This hash is prepopulated with the default operator implementations (see Operator::DEFAULT_OPERATORS). If a default operator implementation is not satisfactory, it can easily be changed by modifying this hash.

resources[R]

The resources dictionary used during processing.

Public Class Methods

new(resources = nil) click to toggle source

Initializes a new processor that uses the resources PDF dictionary for resolving resources while processing operators.

It is not mandatory to set the resources dictionary on initialization but it needs to be set prior to processing operators!

# File lib/hexapdf/content/processor.rb, line 337
def initialize(resources = nil)
  @operators = Operator::DEFAULT_OPERATORS.dup
  @graphics_state = GraphicsState.new
  @graphics_object = :none
  @original_resources = nil
  self.resources = resources
end

Public Instance Methods

process(operator, operands = []) click to toggle source

Processes the operator with the given operands.

The operator is first processed with an operator implementation (if any) to ensure correct operations and then the corresponding method on this object is invoked.

# File lib/hexapdf/content/processor.rb, line 359
def process(operator, operands = [])
  @operators[operator].invoke(self, *operands) if @operators.key?(operator)
  msg = OPERATOR_MESSAGE_NAME_MAP[operator]
  send(msg, *operands) if msg && respond_to?(msg, true)
end
resources=(res) click to toggle source

Sets the resources dictionary used during processing.

The first time resources are set, they are also stored as the “original” resources. This is needed because form XObject don't need to have a resources dictionary and can use the page's resources dictionary instead.

# File lib/hexapdf/content/processor.rb, line 350
def resources=(res)
  @original_resources = res if @original_resources.nil?
  @resources = res
end

Protected Instance Methods

decode_text(data) click to toggle source

Decodes the given text object and returns it as UTF-8 string.

The argument may either be a simple text string (Tj operator) or an array that contains text strings together with positioning information (TJ operator).

# File lib/hexapdf/content/processor.rb, line 389
def decode_text(data)
  if data.kind_of?(Array)
    data = data.each_with_object(''.b) {|obj, result| result << obj if obj.kind_of?(String) }
  end
  font = graphics_state.font
  font.decode(data).map {|code_point| font.to_utf8(code_point) }.join('')
end
decode_text_with_positioning(data) click to toggle source

Decodes the given text object and returns it as a CompositeBox object.

The argument may either be a simple text string (Tj operator) or an array that contains text strings together with positioning information (TJ operator).

For each glyph a GlyphBox object is computed. For horizontal fonts the width is predetermined but not the height. The latter is chosen to be the height and offset of the font's bounding box.

# File lib/hexapdf/content/processor.rb, line 405
def decode_text_with_positioning(data)
  data = Array(data)
  if graphics_state.font.writing_mode == :horizontal
    decode_horizontal_text(data)
  else
    decode_vertical_text(data)
  end
end
paint_xobject(name) click to toggle source

Provides a default implementation for the 'Do' operator.

It checks if the XObject is a Form XObject and if so, processes the contents of the Form XObject.

# File lib/hexapdf/content/processor.rb, line 371
def paint_xobject(name)
  xobject = resources.xobject(name)
  return unless xobject[:Subtype] == :Form

  res = resources
  graphics_state.save

  graphics_state.ctm.premultiply(*xobject[:Matrix]) if xobject.key?(:Matrix)
  xobject.process_contents(self, original_resources: @original_resources)

  graphics_state.restore
  self.resources = res
end

Private Instance Methods

decode_horizontal_text(array) click to toggle source

Decodes the given array containing text and positioning information while assuming that the writing direction is horizontal.

See: PDF1.7 s9.4.4

# File lib/hexapdf/content/processor.rb, line 420
def decode_horizontal_text(array)
  font = graphics_state.font
  scaled_char_space = graphics_state.scaled_character_spacing
  scaled_word_space = (font.word_spacing_applicable? ? graphics_state.scaled_word_spacing : 0)
  scaled_font_size = graphics_state.scaled_font_size

  below_baseline = font.bounding_box[1] * scaled_font_size / \
    graphics_state.scaled_horizontal_scaling + graphics_state.text_rise
  above_baseline = font.bounding_box[3] * scaled_font_size / \
    graphics_state.scaled_horizontal_scaling + graphics_state.text_rise

  text = CompositeBox.new
  array.each do |item|
    if item.kind_of?(Numeric)
      graphics_state.tm.translate(-item * scaled_font_size, 0)
    else
      font.decode(item).each do |code_point|
        char = font.to_utf8(code_point)
        width = font.width(code_point) * scaled_font_size + scaled_char_space + \
          (code_point == 32 ? scaled_word_space : 0)
        matrix = graphics_state.ctm.dup.premultiply(*graphics_state.tm)
        fragment = GlyphBox.new(code_point, char,
                                *matrix.evaluate(0, below_baseline),
                                *matrix.evaluate(width, below_baseline),
                                *matrix.evaluate(0, above_baseline))
        text << fragment
        graphics_state.tm.translate(width, 0)
      end
    end
  end

  text.freeze
end
decode_vertical_text(_data) click to toggle source

Decodes the given array containing text and positioning information while assuming that the writing direction is vertical.

# File lib/hexapdf/content/processor.rb, line 456
def decode_vertical_text(_data)
  raise NotImplementedError
end