class Aquanaut::Worker

The worker contains the actual crawling procedure.

Public Class Methods

new(target) click to toggle source
# File lib/aquanaut/worker.rb, line 8
def initialize(target)
  uri = URI.parse(target)
  @queue = [uri]
  @domain = PublicSuffix.parse(uri.host)

  @visited = Hash.new(false)

  @agent = Mechanize.new do |agent|
    agent.open_timeout = 5
    agent.read_timeout = 5
  end
end

Public Instance Methods

explore() { |uri, links, assets| ... } click to toggle source

Triggers the crawling process.

# File lib/aquanaut/worker.rb, line 23
def explore
  while not @queue.empty?
    uri = @queue.shift  # dequeue
    next if @visited[uri]

    @visited[uri] = true
    puts "Visit #{uri}"

    links, assets = links(uri)
    links.each do |link|
      @queue.push(link) unless @visited[link]  # enqueue
    end

    yield uri, links, assets if block_given?
  end
end
internal?(link) click to toggle source

Evaluates if a link stays in the initial domain.

Used to keep the crawler inside the initial domain. In order to determinate it uses the second-level and top-level domain. If the public suffix cannot be detected due to possibly invalidity returns true to make sure the link does not go unchecked.

@param link [URI] the link to be checked.

@return [Boolean] whether the link is internal or not.

# File lib/aquanaut/worker.rb, line 105
def internal?(link)
  return true unless PublicSuffix.valid?(link.host)
  link_domain = PublicSuffix.parse(link.host)
  @domain.sld == link_domain.sld and @domain.tld == link_domain.tld
end