class SearchYJ::Searcher

Search from the search engine, parse HTML, dig the atound page

@author [indeep-xyz]

Constants

ENCODING
OpenUriError
USER_AGENT

Attributes

limit_loop[RW]
pager[RW]
results[R]
sleep_time[RW]
uri[RW]
user_agent[RW]

Public Class Methods

new( encoding: ENCODING, from: 1, sleep_time: 1, limit_loop: 50, user_agent: USER_AGENT) click to toggle source

Initialize myself. @param encoding [String]

The character encoding that is used to parse HTML

@param from [Integer]

Start to search from this number of the search ranking

@param sleep_time [Integer]

The time of sleep after fetching from internet

@param limit_loop [Integer]

The number of limit that is connectable in one process

@param user_agent [String]

Specify the user agent when open uri
# File lib/searchyj/searcher.rb, line 39
def initialize(
    encoding:   ENCODING,
    from:       1,
    sleep_time: 1,
    limit_loop: 50,
    user_agent: USER_AGENT)
  @pager      = PageSizeAdjuster.new
  @uri        = UriManager.new
  @uri.index  = from
  @encoding   = encoding
  @limit_loop = limit_loop
  @sleep_time = sleep_time
  @user_agent = user_agent
end

Public Instance Methods

run(&block) click to toggle source
# File lib/searchyj/searcher.rb, line 54
def run(&block)
  loop_count = 0
  sorter = RecordSorter.new(@uri.index, @pager.size)

  while loop_count < @limit_loop
    fetch_html
    records = extract_records

    sorter.run(records, &block)

    if records.empty? || final_page?
      break
    end

    next_page(records.size + sorter.page_gap)
    sleep @sleep_time
    loop_count += 1
  end
end

Private Instance Methods

download_raw_html() click to toggle source

Download raw HTML from YJ and return it.

@return [String] raw HTML

# File lib/searchyj/searcher.rb, line 98
def download_raw_html
  uri    = @uri.to_s
  params = {
      'User-Agent' => @user_agent
  }

  params = @pager.attach_cookie(params)

  open(uri, params) do |f|
    fail OpenUriError unless f.status[0] == '200'
    f.read
  end
end
extract_records() click to toggle source

Extract and optimize the records from my own HTML instance.

@return [Array]

Include Hash, [:uri, title]
# File lib/searchyj/searcher.rb, line 81
def extract_records
  results = []
  nodes   = @html.css('#WS2m>.w h3 a')

  nodes.each do |node|
    results.push(
        uri:   node.attribute('href').text,
        title: node.text
    )
  end

  results
end
fetch_html() click to toggle source

Download HTML from YJ and set the parsed HTML data to my own instance.

# File lib/searchyj/searcher.rb, line 114
def fetch_html
  raw_html = download_raw_html
  @html = Nokogiri::HTML.parse(raw_html, nil, @encoding)
end
final_page?() click to toggle source

Check whether or not the next page is exist.

@return [bool]

It is true if the navigation element
for the next page is exist.
Else false.
# File lib/searchyj/searcher.rb, line 125
def final_page?
  a = @html.css('#Sp1 .m a').last

  !(a.is_a?(Nokogiri::XML::Element) &&
    a.text.include?('次へ'))
end
next_page(page_size) click to toggle source

Move to the next page.

# File lib/searchyj/searcher.rb, line 133
def next_page(page_size)
  @uri.move_index(page_size)
end