class NextUrlsInSQS

A specialized class using AmazonSQS to track nodes to walk. It supports two operations: push and pop . Together these can be used to add items to the queue, then pull items off the queue.

This is useful if you want multiple Spider processes crawling the same data set.

To use it with Spider use the store_next_urls_with method:

Spider.start_at('http://example.com/') do |s|
  s.store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY, queue_name)
end

Public Class Methods

new(aws_access_key, aws_secret_access_key, queue_name = 'ruby-spider') click to toggle source

Construct a new NextUrlsInSQS instance. All arguments here are passed to RightAWS::SqsGen2 (part of the right_aws gem) or used to set the AmazonSQS queue name (optional).

# File lib/spider/next_urls_in_sqs.rb, line 23
def initialize(aws_access_key, aws_secret_access_key, queue_name = 'ruby-spider')
  @sqs = RightAws::SqsGen2.new(aws_access_key, aws_secret_access_key)
  @queue = @sqs.queue(queue_name)
end

Public Instance Methods

pop() click to toggle source

Pull an item off the queue, loop until data is found. Data is encoded with YAML.

# File lib/spider/next_urls_in_sqs.rb, line 30
def pop
  while true
    message = @queue.pop
    return YAML::load(message.to_s) unless message.nil?
    sleep 5
  end
end
push(a_msg) click to toggle source

Put data on the queue. Data is encoded with YAML.

# File lib/spider/next_urls_in_sqs.rb, line 39
def push(a_msg)
  encoded_message = YAML::dump(a_msg)
  @queue.push(a_msg)
end