extractTables {orderanalyzer} | R Documentation |
Extract tables from a given words-dataframe
Description
This function extracts order-position-tables from PDF-based order documents. It tries to identify table rows based on a clustering approach and thereafter identifies the column structure. A table row can consist of multiple text rows and the text rows can span different columns. This function furthermore tries to identify the meaning of the columns (position, articleID, description, quantity, quanity unit, unit price, total price, currency, date).
Usage
extractTables(text, minCols = 3, maxDistance = 20, entityNames = NA)
Arguments
text |
List including several representations of text extracted from a PDF file. This list is generated by the function extractText. |
minCols |
Number of columns a table must minimal consist of |
maxDistance |
Number of text lines that can maximally exist between the start of two table rows |
entityNames |
A list of four name vectors (currencyUnits, quantityUnits, headerNames, noTableNames). Each vector contains strings that correspond to currency units, quantity units, header names or names of entities not being a table. |
Value
List of lists describing the tables. Each sublist includes a data frame (data) which is the identified table, the position of text lines that constitute the table and the position of the significant lines.
Examples
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer")
text <- extractText(file)
# Extracting order tables without any further information
tables <- extractTables(text)
tables[[1]]$data
# Extracting order tables with further information
tables <- extractTables(text,
entityNames = list(currencyUnits = enc2utf8(c("eur", "euro", "\u20AC")),
quantityUnits = enc2utf8(c("pcs", "pcs.")),
headerNames = enc2utf8(c("pos", "item", "quantity")),
noTableNames = enc2utf8(c("order total", "supplier number")))
)
tables[[1]]$data
# Extracting order tables from a German document
file <- system.file("extdata", "OrderDocument_de.pdf", package = "orderanalyzer")
text <- extractText(file)
tables <- extractTables(text)
tables[[1]]$data