class HexaPDF::Content::Parser
This class knows how to correctly parse a content stream.
Overview¶ ↑
A content stream is mostly just a stream of PDF objects. However, there is one exception: inline images.
Since inline images don't follow the normal PDF object parsing rules, they need to be handled specially and this is the reason for this class. Therefore only the BI operator is ever called for inline images because the ID and EI operators are handled by the parser.
To parse some contents the parse
method needs to be called with the contents to be parsed and a Processor
object which is used for processing the parsed operators.
Public Class Methods
Public Instance Methods
Parses the contents and calls the processor object or the given block for each parsed operator.
If a full-blown Processor
is not needed (e.g. because the graphics state doesn't need to be maintained), one can use the block form to handle the parsed objects and their parameters.
Note: The parameters array is reused for each processed operator, so duplicate it if necessary.
# File lib/hexapdf/content/parser.rb, line 176 def parse(contents, processor = nil, &block) #:yields: object, params raise ArgumentError, "Argument processor or block is needed" if processor.nil? && block.nil? if processor.nil? block.singleton_class.send(:alias_method, :process, :call) processor = block end tokenizer = Tokenizer.new(contents, raise_on_eos: true) params = [] loop do obj = tokenizer.next_object(allow_keyword: true) if obj.kind_of?(Tokenizer::Token) if obj == 'BI' params = parse_inline_image(tokenizer) end processor.process(obj.to_sym, params) params.clear else params << obj end end end
Private Instance Methods
Parses the inline image at the current position.
# File lib/hexapdf/content/parser.rb, line 204 def parse_inline_image(tokenizer) # BI has already been read, so read the image dictionary dict = {} while (key = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS) if key == 'ID' break elsif key == Tokenizer::NO_MORE_TOKENS raise HexaPDF::Error, "EOS while trying to read dictionary key for inline image" elsif !key.kind_of?(Symbol) raise HexaPDF::Error, "Inline image dictionary keys must be PDF name objects" end value = tokenizer.next_object rescue Tokenizer::NO_MORE_TOKENS if value == Tokenizer::NO_MORE_TOKENS raise HexaPDF::Error, "EOS while trying to read dictionary value for inline image" end dict[key] = value end # one whitespace character after ID tokenizer.next_byte real_end_found = false image_data = ''.b # find the EI operator and handle EI appearing inside the image data until real_end_found data = tokenizer.scan_until(/(?=EI(?:[#{Tokenizer::WHITESPACE}]|\z))/o) if data.nil? raise HexaPDF::Error, "End inline image marker EI not found" end image_data << data tokenizer.pos += 2 last_pos = tokenizer.pos # Check if we found EI inside of the image data count = 0 while count < MAX_TOKEN_CHECK token = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS if token == Tokenizer::NO_MORE_TOKENS count += MAX_TOKEN_CHECK elsif token.kind_of?(Tokenizer::Token) && !Processor::OPERATOR_MESSAGE_NAME_MAP.key?(token.to_sym) break # invalid token end count += 1 end if count >= MAX_TOKEN_CHECK real_end_found = true else image_data << "EI" end tokenizer.pos = last_pos end [dict, image_data] end