class Shelob::Spider

This is the central workhorse class of Shelob. It takes a url, fetches it, and then spiders through any children of that url and fetches them as well.

Attributes

hostname[RW]

The root url which this Spider instance is working underneath

queue[RW]

The current queue of urls to check

Public Class Methods

new(hostname, options = {}) click to toggle source

Create a new spider with the given hostname and options

Valid options:

  • verbose: 0 for no output, 1 for progress output, 2 for verbose output

  • seed: Provide an initial seed value, other than the root url you’re providing

# File lib/shelob.rb, line 26
def initialize hostname, options = {}
  # Data
  @hostname = hostname

  # Options
  @verbose = options[:verbose] == 1 ? true : false
  @chatty = options[:verbose] == 2 ? true : false

  # Internal
  if options[:seed].nil?
    @queue = [ hostname ]
  else
    @queue = [ options[:seed] ]
  end
end

Public Instance Methods

check() click to toggle source

Entry point to the main spider process. This is the main API point, and will return once the site has been completely spidered.

Returns a list of all failed urls, and their particular error code (404, 500, etc.)

# File lib/shelob.rb, line 48
def check
  # set up variables
  @urls ||= Set.new
  @failures ||= []

  # kick the spider off
  run_spider

  @failures
end
enqueue(links) click to toggle source

Add the given links to our internal queue to ensure they are checked.

# File lib/shelob.rb, line 127
def enqueue links
  children = filter links

  @queue.push(*children)
end
extract(url) click to toggle source

Extract links from the given url.

Returns an array of all link targets on the page.

# File lib/shelob.rb, line 111
def extract url
  page = fetch url

  Extractor.new(page).extract
end
fetch(url) click to toggle source

Load a page from the internet, appending it to the failures array if the fetch encountered an error.

Returns a LinkResult with the results of fetching the page.

# File lib/shelob.rb, line 100
def fetch url
  page = Resolver.new(url).resolve

  @failures << page if page.failed?

  page
end
fetched() click to toggle source

Return an array of all urls that were fetched in the process of spidering the site.

# File lib/shelob.rb, line 78
def fetched
  return @urls
end
filter(links) click to toggle source

Filter links to ensure they are children of the root url, and removes duplicates

# File lib/shelob.rb, line 119
def filter links
  links.select do |link|
    link.start_with? @hostname
  end.uniq
end
finish(url) click to toggle source

Signal that processing is done on a given url, so that it won’t be checked again

# File lib/shelob.rb, line 135
def finish url
  @urls << url
end
post_process_notify(url) click to toggle source

Notify that a url has just been processed. Currently only used to print status

# File lib/shelob.rb, line 90
def post_process_notify url
  print '.' if @verbose
  puts "checked!" if @chatty
end
pre_process_notify(url) click to toggle source

Notify that a url is about to be processed. Currently only used to print status

# File lib/shelob.rb, line 84
def pre_process_notify url
  print "#{url}... " if @chatty
end
process(url) click to toggle source

Given a url, fetch it, extract all links, and enqueue those links for later processing.

# File lib/shelob.rb, line 141
def process url
  links = extract url

  enqueue links

  finish url
end
remaining() click to toggle source

Returns a count of the remaining urls to parse - this number is only a view of the current state, as more urls are constantly being added as other urls resolve.

This would only be useful to call from another thread at this time, as check is a blocking call

# File lib/shelob.rb, line 66
def remaining
  return @queue.count
end
requests() click to toggle source

Return the total number of urls that were fetched in the spidering process.

# File lib/shelob.rb, line 72
def requests
  return @urls.count
end
run_spider() click to toggle source

Internal helper method to kick off the spider once everything has been properly configured.

# File lib/shelob.rb, line 151
def run_spider
  while not @queue.empty?
    url = @queue.shift

    next if @urls.include? url

    pre_process_notify url

    process url

    post_process_notify url
  end
end