class SpiderInstance

Public Instance Methods

add_url_check(&block) click to toggle source

Add a predicate that determines whether to continue down this URL's path. All predicates must be true in order for a URL to proceed.

Takes a block that takes a string and produces a boolean. For example, this will ensure that the URL starts with 'cashcats.biz':

add_url_check { |a_url| a_url =~ %r{^http://cashcats.biz.*}
# File lib/spider/spider_instance.rb, line 48
def add_url_check(&block)
  @url_checks << block
end
check_already_seen_with(cacher) click to toggle source

The Web is a graph; to avoid cycles we store the nodes (URLs) already visited. The Web is a really, really, really big graph; as such, this list of visited nodes grows really, really, really big.

Change the object used to store these seen nodes with this. The default object is an instance of Array. Available with Spider is a wrapper of memcached.

You can implement a custom class for this; any object passed to check_already_seen_with must understand just << and included? .

# default
check_already_seen_with Array.new

# memcached
require 'spider/included_in_memcached'
check_already_seen_with IncludedInMemcached.new('localhost:11211')
# File lib/spider/spider_instance.rb, line 69
def check_already_seen_with(cacher)
  if cacher.respond_to?(:<<) && cacher.respond_to?(:include?)
    @seen = cacher
  else
    raise ArgumentError, 'expected something that responds to << and included?'
  end
end
clear_headers() click to toggle source

Reset the headers hash.

# File lib/spider/spider_instance.rb, line 160
def clear_headers
  @headers = {}
end
headers() click to toggle source

Use like a hash:

headers['Cookies'] = 'user_id=1;password=btrross3'
# File lib/spider/spider_instance.rb, line 148
def headers
  HeaderSetter.new(self)
end
on(code, p = nil, &block) click to toggle source

Add a response handler. A response handler's trigger can be :every, :success, :failure, or any HTTP status code. The handler itself can be either a Proc or a block.

The arguments to the block are: the URL as a string, an instance of Net::HTTPResponse, and the prior URL as a string.

For example:

on 404 do |a_url, resp, prior_url|
  puts "URL not found: #{a_url}"
end

on :success do |a_url, resp, prior_url|
  puts a_url
  puts resp.body
end

on :every do |a_url, resp, prior_url|
  puts "Given this code: #{resp.code}"
end
# File lib/spider/spider_instance.rb, line 123
def on(code, p = nil, &block)
  f = p ? p : block
  case code
  when Integer
    @callbacks[code] = f
  else
    @callbacks[code.to_sym] = f
  end
end
setup(p = nil, &block) click to toggle source

Run before the HTTP request. Given the URL as a string.

setup do |a_url|
  headers['Cookies'] = 'user_id=1;admin=true'
end
# File lib/spider/spider_instance.rb, line 137
def setup(p = nil, &block)
  @setup = p ? p : block
end
store_next_urls_with(a_store) click to toggle source

The Web is a really, really, really big graph; as such, this list of nodes to visit grows really, really, really big.

Change the object used to store nodes we have yet to walk. The default object is an instance of Array. Available with Spider is a wrapper of AmazonSQS.

You can implement a custom class for this; any object passed to check_already_seen_with must understand just push and pop .

# default
store_next_urls_with Array.new

# AmazonSQS
require 'spider/next_urls_in_sqs'
store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY, queue_name)
# File lib/spider/spider_instance.rb, line 93
def store_next_urls_with(a_store)
  tmp_next_urls = @next_urls
  @next_urls = a_store
  tmp_next_urls.each do |a_url_hash|
    @next_urls.push a_url_hash
  end
end
teardown(p = nil, &block) click to toggle source

Run last, once for each page. Given the URL as a string.

# File lib/spider/spider_instance.rb, line 142
def teardown(p = nil, &block)
  @teardown = p ? p : block
end