module WaybackArchiver

WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.

Constants

DEFAULT_CONCURRENCY

Default concurrency for archiving URLs

DEFAULT_MAX_LIMIT

Maxmium number of links posted (-1 is no limit)

DEFAULT_RESPECT_ROBOTS_TXT

Default for whether to respect robots txt files

Link to gem on rubygems.org, part of the sent User-Agent

Response

Response data struct

USER_AGENT

WaybackArchiver User-Agent

VERSION

Gem version

Public Class Methods

adapter() click to toggle source

Returns the configured adapter @return [Integer] the configured or the default adapter

# File lib/wayback_archiver.rb, line 230
def self.adapter
  @adapter ||= WaybackMachine
end
adapter=(adapter) click to toggle source

Sets the adapter @return [Object, call>] the configured adapter @param [Object, call>] the adapter

# File lib/wayback_archiver.rb, line 220
def self.adapter=(adapter)
  unless adapter.respond_to?(:call)
    raise(ArgumentError, 'adapter must implement #call')
  end

  @adapter = adapter
end
archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) click to toggle source

Send URLs to Wayback Machine. @return [Array<ArchiveResult>] of URLs sent to the Wayback Machine. @param [String/Array<String>] source for URL(s). @param [String/Symbol] strategy of source. Supported strategies: crawl, sitemap, url, urls, auto. @param [Array<String, Regexp>] hosts to crawl. @example Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com') # Default strategy is :auto
WaybackArchiver.archive('example.com', strategy: :auto)
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :auto, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :auto)

@example Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com', strategy: :crawl)
WaybackArchiver.archive('example.com', strategy: :crawl, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :crawl, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :crawl)

@example Send example.com Sitemap URLs

WaybackArchiver.archive('example.com', strategy: :sitemap)
WaybackArchiver.archive('example.com', strategy: :sitemap, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :sitemap, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :sitemap)

@example Send only example.com

WaybackArchiver.archive('example.com', strategy: :url)
WaybackArchiver.archive('example.com', strategy: :url, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :url, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :url)

@example Crawl multiple hosts

WaybackArchiver.archive(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)
# File lib/wayback_archiver.rb, line 57
def self.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  strategy = legacy_strategy || strategy

  case strategy.to_s
  when 'crawl'   then crawl(source, concurrency: concurrency, limit: limit, hosts: hosts, &block)
  when 'auto'    then auto(source, concurrency: concurrency, limit: limit, &block)
  when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block)
  when 'urls'    then urls(source, concurrency: concurrency, limit: limit, &block)
  when 'url'     then urls(source, concurrency: concurrency, limit: limit, &block)
  else
    raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl"
  end
end
auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) click to toggle source

Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine. @return [Array<ArchiveResult>] of URLs sent to the Wayback Machine. @param [String] source (must be a valid URL). @param concurrency [Integer] @example Auto archive example.com

WaybackArchiver.auto('example.com') # Default concurrency is 1

@example Auto archive example.com with low concurrency

WaybackArchiver.auto('example.com', concurrency: 1)

@example Auto archive example.com and archive max 100 URLs

WaybackArchiver.auto('example.com', limit: 100)

@see www.sitemaps.org

# File lib/wayback_archiver.rb, line 83
def self.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  urls = Sitemapper.autodiscover(source)
  return urls(urls, concurrency: concurrency, &block) if urls.any?

  crawl(source, concurrency: concurrency, &block)
end
concurrency() click to toggle source

Returns the default concurrency @return [Integer] the configured or the default concurrency

# File lib/wayback_archiver.rb, line 200
def self.concurrency
  @concurrency ||= DEFAULT_CONCURRENCY
end
concurrency=(concurrency) click to toggle source

Sets the default concurrency @return [Integer] the desired default concurrency @param [Integer] concurrency the desired default concurrency

# File lib/wayback_archiver.rb, line 194
def self.concurrency=(concurrency)
  @concurrency = concurrency
end
crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) click to toggle source

Crawl site for URLs to send to the Wayback Machine. @return [Array<ArchiveResult>] of URLs sent to the Wayback Machine. @param [String] url to start crawling from. @param [Array<String, Regexp>] hosts to crawl @param concurrency [Integer] @example Crawl example.com and send all URLs of the same domain

WaybackArchiver.crawl('example.com') # Default concurrency is 1

@example Crawl example.com and send all URLs of the same domain with low concurrency

WaybackArchiver.crawl('example.com', concurrency: 1)

@example Crawl example.com and archive max 100 URLs

WaybackArchiver.crawl('example.com', limit: 100)

@example Crawl multiple hosts

URLCollector.crawl(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)
# File lib/wayback_archiver.rb, line 109
def self.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Crawling #{url}"
  Archive.crawl(url, hosts: hosts, concurrency: concurrency, limit: limit, &block)
end
default_logger!() click to toggle source

Resets the logger to the default @return [NullLogger] a new instance of NullLogger

# File lib/wayback_archiver.rb, line 161
def self.default_logger!
  @logger = NullLogger.new
end
logger() click to toggle source

Returns the current logger @return [Object] the current logger instance

# File lib/wayback_archiver.rb, line 155
def self.logger
  @logger ||= NullLogger.new
end
logger=(logger) click to toggle source

Set logger @return [Object] the set logger @param [Object] logger an object than response to quacks like a Logger @example set a logger that prints to standard out (STDOUT)

WaybackArchiver.logger = Logger.new(STDOUT)
# File lib/wayback_archiver.rb, line 149
def self.logger=(logger)
  @logger = logger
end
max_limit() click to toggle source

Returns the default max_limit @return [Integer] the configured or the default max_limit

# File lib/wayback_archiver.rb, line 213
def self.max_limit
  @max_limit ||= DEFAULT_MAX_LIMIT
end
max_limit=(max_limit) click to toggle source

Sets the default max_limit @return [Integer] the desired default max_limit @param [Integer] max_limit the desired default max_limit

# File lib/wayback_archiver.rb, line 207
def self.max_limit=(max_limit)
  @max_limit = max_limit
end
respect_robots_txt() click to toggle source

Returns the default respect_robots_txt @return [Boolean] the configured or the default respect_robots_txt

# File lib/wayback_archiver.rb, line 187
def self.respect_robots_txt
  @respect_robots_txt ||= DEFAULT_RESPECT_ROBOTS_TXT
end
respect_robots_txt=(respect_robots_txt) click to toggle source

Sets the default respect_robots_txt @return [Boolean] the desired default for respect_robots_txt @param [Boolean] respect_robots_txt the desired default

# File lib/wayback_archiver.rb, line 181
def self.respect_robots_txt=(respect_robots_txt)
  @respect_robots_txt = respect_robots_txt
end
sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) click to toggle source

Get URLs from sitemap and send found URLs to the Wayback Machine. @return [Array<ArchiveResult>] of URLs sent to the Wayback Machine. @param [String] url to the sitemap. @param concurrency [Integer] @example Get example.com sitemap and archive all found URLs

WaybackArchiver.sitemap('example.com/sitemap.xml') # Default concurrency is 1

@example Get example.com sitemap and archive all found URLs with low concurrency

WaybackArchiver.sitemap('example.com/sitemap.xml', concurrency: 1)

@example Get example.com sitemap archive max 100 URLs

WaybackArchiver.sitemap('example.com/sitemap.xml', limit: 100)

@see www.sitemaps.org

# File lib/wayback_archiver.rb, line 125
def self.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Fetching Sitemap"
  Archive.post(URLCollector.sitemap(url), concurrency: concurrency, limit: limit, &block)
end
urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) click to toggle source

Send URL to the Wayback Machine. @return [Array<ArchiveResult>] of URLs sent to the Wayback Machine. @param [Array<String>/String] urls or url. @param concurrency [Integer] @example Archive example.com

WaybackArchiver.urls('example.com')

@example Archive example.com and google.com

WaybackArchiver.urls(%w(example.com google.com))

@example Archive example.com, max 100 URLs

WaybackArchiver.urls(%w(example.com www.example.com), limit: 100)
# File lib/wayback_archiver.rb, line 140
def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  Archive.post(Array(urls), concurrency: concurrency, &block)
end
user_agent() click to toggle source

Returns the configured user agent @return [String] the configured or the default user agent

# File lib/wayback_archiver.rb, line 174
def self.user_agent
  @user_agent ||= USER_AGENT
end
user_agent=(user_agent) click to toggle source

Sets the user agent @return [String] the configured user agent @param [String] user_agent the desired user agent

# File lib/wayback_archiver.rb, line 168
def self.user_agent=(user_agent)
  @user_agent = user_agent
end