class Daru::IO::Importers::HTML

HTML Importer Class, that extends `read_html` method to `Daru::DataFrame`

@note

Please note that this module works only for static table elements on a
HTML page, and won't work in cases where the data is being loaded into
the HTML table by inline Javascript.

Public Class Methods

new() click to toggle source

Checks for required gem dependencies of HTML Importer

# File lib/daru/io/importers/html.rb, line 16
def initialize
  require 'open-uri'
  optional_gem 'nokogiri'
end

Public Instance Methods

call(match: nil, order: nil, index: nil, name: nil) click to toggle source

Imports Array of `Daru::DataFrame`s from a HTML Importer instance

@param match [String] A `String` to match and choose a particular table(s)

from multiple tables of a HTML page.

@param index [Array or Daru::Index or Daru::MultiIndex] If given, it

overrides the parsed index. Have a look at `:index` option, at
{http://www.rubydoc.info/gems/daru/0.1.5/Daru%2FDataFrame:initialize
Daru::DataFrame#initialize}

@param order [Array or Daru::Index or Daru::MultiIndex] If given, it

overrides the parsed order. Have a look at `:order` option
[here](http://www.rubydoc.info/gems/daru/0.1.5/Daru%2FDataFrame:initialize)

@param name [String] As `name` of the imported `Daru::DataFrame` isn't

parsed automatically by the module, users can set the name attribute to
their `Daru::DataFrame` manually, through this option.

See `:name` option
[here](http://www.rubydoc.info/gems/daru/0.1.5/Daru%2FDataFrame:initialize)

@return [Array<Daru::DataFrame>]

@example Importing with matching tables

list_of_dfs = instance.call(match: 'Sun Pharma')
list_of_dfs.count
#=> 4

df = list_of_dfs.first

# As the website keeps changing everyday, the output might not be exactly
# the same as the one obtained below. Nevertheless, a Daru::DataFrame
# should be obtained (as long as 'Sun Pharma' is there on the website).

#=> <Daru::DataFrame(5x4)>
#        Company      Price     Change Value (Rs
#   0 Sun Pharma     502.60     -65.05   2,117.87
#   1   Reliance    1356.90      19.60     745.10
#   2 Tech Mahin     379.45     -49.70     650.22
#   3        ITC     315.85       6.75     621.12
#   4       HDFC    1598.85      50.95     553.91
# File lib/daru/io/importers/html.rb, line 75
def call(match: nil, order: nil, index: nil, name: nil)
  @match   = match
  @options = {name: name, index: index, order: order}

  @file_data
    .search('table')
    .map { |table| parse_table(table) }
    .compact
    .keep_if { |table| satisfy_dimension(table) && search(table) }
    .map { |table| decide_values(table, @options) }
    .map { |table| table_to_dataframe(table) }
end
read(path) click to toggle source

Reads from a html file / website

@!method self.read(path)

@param path [String] Website URL / path to HTML file, where the

DataFrame is to be imported from.

@return [Daru::IO::Importers::HTML]

@example Reading from a website url file

instance = Daru::IO::Importers::HTML.read('http://www.moneycontrol.com/')
# File lib/daru/io/importers/html.rb, line 32
def read(path)
  @file_data = Nokogiri.parse(open(path).read)
  self
end

Private Instance Methods

decide_values(scraped_val, user_val) click to toggle source

Allows user to override the scraped order / index / data

# File lib/daru/io/importers/html.rb, line 91
def decide_values(scraped_val, user_val)
  scraped_val.merge(user_val) { |_key, scraped, user| user || scraped }
end
parse_hash(headers, size, headers_size) click to toggle source

Splits headers (all th tags) into order and index. Wherein, Order : All <th> tags on first proper row of HTML table index : All <th> tags on first proper column of HTML table

# File lib/daru/io/importers/html.rb, line 98
def parse_hash(headers, size, headers_size)
  headers_index = headers.find_index { |x| x.count == headers_size }
  order = headers[headers_index]
  order_index = order.count - size
  order = order[order_index..-1]
  indice = headers[headers_index+1..-1].flatten
  indice = nil if indice.to_a.empty?
  [order, indice]
end
parse_table(table) click to toggle source
# File lib/daru/io/importers/html.rb, line 108
def parse_table(table)
  headers, headers_size = scrape_tag(table,'th')
  data, size = scrape_tag(table, 'td')
  data = data.keep_if { |x| x.count == size }
  order, indice = parse_hash(headers, size, headers_size) if headers_size >= size
  return unless (indice.nil? || indice.count == data.count) && !order.nil? && order.count>0
  {data: data.compact, index: indice, order: order}
end
satisfy_dimension(table) click to toggle source
# File lib/daru/io/importers/html.rb, line 123
def satisfy_dimension(table)
  return false if @options[:order] && table[:data].first.size != @options[:order].size
  return false if @options[:index] && table[:data].size != @options[:index].size
  true
end
scrape_tag(table, tag) click to toggle source
# File lib/daru/io/importers/html.rb, line 117
def scrape_tag(table, tag)
  arr  = table.search('tr').map { |row| row.search(tag).map { |val| val.text.strip } }
  size = arr.map(&:count).max
  [arr, size]
end
table_to_dataframe(table) click to toggle source
# File lib/daru/io/importers/html.rb, line 133
def table_to_dataframe(table)
  Daru::DataFrame.rows(
    table[:data],
    index: table[:index],
    order: table[:order],
    name: table[:name]
  )
end