module PDFTDX::Parser
Parser
Module
Constants
- LINE_REGEX
Line Regex
- MAX_CELL_LEN
Maximum Cell Length (to be considered usable data)
- PAGE_MAX_TOP
Maximum Allowed Offset from Page Top
- PAGE_OFF
Page Offset
- TITLE_CELL_REGEX
Title Cell Regex
Public Class Methods
Build Data Table: Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks. @param [Array] data An array of document chunks, each represented as a hash containing the position and body of the chunk. Example: [{ top: 10, left: 100, data: 'Machine OS' }, { top: 10, left: 220, data: 'Win32' }, { top: 10, left: 340, data: 'Linux' }, { top: 10, left: 460, data: 'MacOS' }] @return [Hash] A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' }, 35 => { 100 => 'IP Address', 220 => '10.0.232.48', 340 => '10.0.232.134', 460 => '10.0.232.108' } }
# File lib/pdftdx/parser.rb, line 80 def self.build_table data table = {} data.each { |d| table[d[:top]] ||= {}; table[d[:top]][d[:left]] = d[:data] } table end
Collect Data: Extracts table-like chunks of HTML data from a hash of HTML pages. @param [Hash] data A hash of document pages, mapped by their page index. Each page is an array of chomp'd lines of HTML data. Example: { 1 => ['<h1>Hello World!</h1>', 'This is page one.'], 2 => ['Wow, another page of data !', 'Important stuff', 'That's it for page 2 !'] } @return [Array] An array of HTML chunks, each represented as a hash containing the chunk position and data. Example: [{ top: 10, left: 100, data: 'Machine OS' }, { top: 10, left: 220, data: 'Win32' }, { top: 10, left: 340, data: 'Linux' }, { top: 10, left: 460, data: 'MacOS' }]
# File lib/pdftdx/parser.rb, line 60 def self.collect_data data # Build HTML Entity Decoder coder = HTMLEntities.new # Collect File Data off = 0 data.collect do |_idx, page| off = off + PAGE_OFF page .select { |l| LINE_REGEX =~ l } # Collect Table-like data .collect { |l| LINE_REGEX.match l } # Extract Table Element Metadata (Position) .collect { |d| { top: off + d[1].to_i, left: d[2].to_i, data: hfilter(coder.decode(d[3])) } } # Produce Hash of Raw Table Data end.flatten end
Contains Unusable Data (Empty / Long Strings): Determines whether a row contains unusable data. @param [Hash] row_data A hash of table cells, mapped by their offset from the left. Example: { 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' } @return [Boolean] True if at least one cell is unusable (empty, oversize), False otherwise
# File lib/pdftdx/parser.rb, line 44 def self.contains_unusable? row_data row_data.inject(false) { |b, e| b || (e[1].length == 0) || (e[1].length > MAX_CELL_LEN) } end
Filter Table Rows: Filters out rows considered unusable, empty, oversize, footers, etc… Also, strips Top Offset info from Table Rows. @param [Hash] data A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' }, 35 => { 100 => 'IP Address', 220 => '10.0.232.48', 340 => '10.0.232.134', 460 => '10.0.232.108' } } @return [Array] An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' }, { 100 => 'IP Address', 220 => '10.0.232.48', 340 => '10.0.232.134', 460 => '10.0.232.108' }]
# File lib/pdftdx/parser.rb, line 91 def self.filter_rows data data .reject { |top, row| row.size < 2 || (top % PAGE_OFF) >= PAGE_MAX_TOP || is_all_same?(row) || contains_unusable?(row) } # Drop Single-Element Rows, Footer Data, Useless Rows (all cells identical) & Unusable Rows (Empty / Oversize Cells) .collect { |_top, r| r }.reject { |r| r.size < 2 } # Remove 'top offset' information and re-drop single-element rows end
Fix Dupes: Shifts Duplicate Cells (Cells which share their x-offset with others) to the right (so they don't get overwritten) @param [Array] r A row of data in the form [[xoffset, cell]] (Example: [[120, 'cell 0'], [200, 'cell 1'], [280, 'cell 2']]) @param [Array] The same row of data, but with duplicate cells shifted so that no x-offset-collisions occur
# File lib/pdftdx/parser.rb, line 159 def self.fix_dupes r # Deep-Duplicate Row nr = r.collect { |e| e.clone } # Run through Cells nr.length.times do |i| # Acquire Duplicate Length dupes = nr.slice(i + 1, nr.length).inject(0) { |a, c| a + (c[0] == nr[i][0] ? 1 : 0) } # Fix Dupes dupes.times { |j| nr[i + j + 1][0] = nr[i + j + 1][0] + 1 } end nr end
HTML Filter: Replaces HTML newlines by UNIX-style newlines. @param [String] s A string of HTML data @return [String] The same string of HTML data, with all newlines (<br/> tags) converted to UNIX newlines.
# File lib/pdftdx/parser.rb, line 52 def self.hfilter s s.gsub '<br/>', "\n" end
Determine Headered Table Length: Computes the number of rows to be included in a given headered table. @param [Array] table An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' }, { 100 => 'IP Address', 220 => '10.0.232.48', 340 => '10.0.232.134', 460 => '10.0.232.108' }] @param [Array] headers An array of header rows, each represented as a hash containing the header row's index within the table array, and the actual row data. Example: [{ idx: 0, row: ['trauma.eresse.net', 'durjaya.dooba.io', 'suessmost.eresse.net'] }] @param [Hash] h The current header row (determine htable length from this) @param [Fixnum] i The current header's index within the headers array @return [Fixnum] The number of rows
# File lib/pdftdx/parser.rb, line 104 def self.htable_length table, headers, h, i (headers[i + 1] ? headers[i + 1][:idx] : table.length) - h[:idx] end
Is All Same Data: Determine whether a row's cells all contain the same data. @param [Hash] row_data A hash of table cells, mapped by their offset from the left. Example: { 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' } @return [Boolean] True if all cells contain the same data, False otherwise.
# File lib/pdftdx/parser.rb, line 35 def self.is_all_same? row_data n = row_data[row_data.keys[0]] row_data.inject(true) { |b, e| b && (e[1] == n) } end
Process: Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure. @param [Hash] page_data A hash of document pages, mapped by their page index. Each page is an array of chomp'd lines of HTML data. Example: { 1 => ['<h1>Hello World!</h1>', 'This is page one.'], 2 => ['Wow, another page of data !', 'Important stuff', 'That's it for page 2 !'] } @return [Array] An array of tables, each represented as a hash containing an optional header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: ['trauma.eresse.net', 'durjaya.dooba.io', 'suessmost.eresse.net'], data: { 'System' => [['Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] } }]
# File lib/pdftdx/parser.rb, line 226 def self.process page_data # Collect Data data = collect_data page_data # Build Data Table table = build_table data # Filter Rows table = filter_rows table # Filter Table Cells & Touch up touch_up table end
Sort Row: Sorts Cells according to their x-offset @param [Hash] r A row of data in the form { xoffset => cell } (Example: { 120 => 'cell 0', 200 => 'cell 1', 280 => 'cell 2' }) @return [Hash] The same row of data, but sorted according to x-offset
# File lib/pdftdx/parser.rb, line 151 def self.sort_row r Hash[*(r.to_a.sort { |a, b| ((a[0] == b[0]) ? 0 : (a[0] > b[0] ? 1 : -1)) }.flatten)] end
Sub Table Length: Computes the number of rows to be included in a given sub-table. @param [Array] table An array of table rows, each represented as an array of table cells. Example: [['System', 'Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] @param [Array] stables An array of named tables, each represented as a hash containing the name and its starting index within the table array. Example: [{ title: 'System Info', idx: 0 }] @param [Hash] t The current sub-table title row (determine stable length from this) @param [Fixnum] i The current sub-table title's index within the stable array @return [Fixnum] The number of rows
# File lib/pdftdx/parser.rb, line 115 def self.sub_tab_len table, stables, t, i (stables[i + 1] ? stables[i + 1][:idx] : table.length) - t[:idx] end
Sub-Tablize: Splits a table into multiple named tables. @param [Array] htable_data An array of table rows, each represented as an array of table cells. Example: [['System', 'Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] @return [Array] An array of named tables, each represented as a hash containing the name and the table itself. May also contain a single array, containing all remaining table data (unnamed). Example: [{ name: 'System', data: [['Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] }, [['32.40 $', '34.00 $', '88.40 $'], ['21.40 km', '12.00 km', '99.10 km']]]
# File lib/pdftdx/parser.rb, line 123 def self.sub_tablize htable_data # Collect Sub-table Title Rows subtab_titles = htable_data.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| TITLE_CELL_REGEX =~ e[:row][0] }.collect { |e| { title: e[:row][0], idx: e[:idx] } } # Pull up Sub-tables stables = subtab_titles.collect.with_index do |t, i| { name: t[:title].gsub(/<\/?b>/, ''), # Extract Sub-Table Name data: htable_data # Extract Sub-Table Data .slice(t[:idx], sub_tab_len(htable_data, subtab_titles, t, i)) # Slice Table Data until next Sub-Table .collect { |e| e.reject.with_index { |c, ii| ii == 0 && TITLE_CELL_REGEX =~ c } } # Reject Table Headers } end # Data until first sub-table index is considered 'unsorted' unsorted_end = subtab_titles.empty? ? htable_data.length : subtab_titles[0][:idx] # Insert last part (Unsorted) stables << htable_data.slice(0, unsorted_end) if unsorted_end > 0 stables end
Touch up Table: Splits Table into multiple headered tables. Also, strips Left Offset info from Table Cells. @param [Array] table An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => 'Machine OS', 220 => 'Win32', 340 => 'Linux', 460 => 'MacOS' }, { 100 => 'IP Address', 220 => '10.0.232.48', 340 => '10.0.232.134', 460 => '10.0.232.108' }] @return [Array] An array of tables, each represented as either a single array of rows, or a hash containing a header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: ['trauma.eresse.net', 'durjaya.dooba.io', 'suessmost.eresse.net'], data: [{ name: 'System', data: [['Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] }] }]
# File lib/pdftdx/parser.rb, line 182 def self.touch_up table # Split Table into multiple Headered Tables headers = table .collect.with_index { |r, i| { idx: i, row: r } } .select { |e| e[:row].inject(true) { |b, c| b && (TITLE_CELL_REGEX =~ c[1]) } } .collect { |r| { idx: r[:idx], row: r[:row].collect { |o, v| { o => v.gsub(/<\/?b>/, '') } } } } # Pull up Headered Tables htables = headers.collect.with_index { |h, i| { head: h[:row], data: table.slice(h[:idx] + 1, htable_length(table, headers, h, i) - 1) } } # Fix Rows nh = htables.collect do |t| # Acquire Column Offsets cols = t[:head].collect { |o| o.first[0] }.sort # Compute Row Base (Default Columns) row_base = Hash[*(cols.collect { |c| [c, ''] }.flatten)] # Re-Build Table { head: t[:head], data: t[:data].collect { |r| sort_row row_base.merge(Hash[*((fix_dupes r.collect { |o, c| [(cols.reverse.find { |co| co <= o }) || o, c] }).flatten)]) } } end # Drop Offsets htables = nh.collect { |t| { head: t[:head].collect { |h| h.first[1] }, data: t[:data].collect { |r| r.collect { |_o, c| c } } } } ntable = table.collect { |r| r.collect { |_o, c| c } } # Split Headered Tables into multiple Named Sub-Tables htables.collect! { |ht| { head: ht[:head], data: sub_tablize(ht[:data]) } } # Data until first Header index is considered 'unsorted' unsorted_end = headers.empty? ? ntable.length : headers[0][:idx] # Insert last part (Unsorted) htables << sub_tablize(ntable.slice(0, unsorted_end)) if unsorted_end > 0 htables end