class WaybackArchiver::Sitemap

Parse Sitemaps, www.sitemaps.org

Attributes

document[R]

Public Class Methods

new(xml, strict: false) click to toggle source
# File lib/wayback_archiver/sitemap.rb, line 8
def initialize(xml, strict: false)
  @document = REXML::Document.new(xml)
rescue REXML::ParseException => _e
  raise if strict

  @document = REXML::Document.new('')
end

Public Instance Methods

plain_document?() click to toggle source

Check if sitemap is a plain file @return [Boolean] whether document is plain

# File lib/wayback_archiver/sitemap.rb, line 36
def plain_document?
  document.elements.empty?
end
root_name() click to toggle source

Return the name of the document (if there is one) @return [String] the document root name

# File lib/wayback_archiver/sitemap.rb, line 42
def root_name
  return unless document.root

  document.root.name
end
sitemap_index?() click to toggle source

Returns true of Sitemap is a Sitemap index @return [Boolean] of whether the Sitemap is an Sitemap index or not @example Check if Sitemap is a sitemap index

sitemap = Sitemap.new(xml)
sitemap.sitemap_index?
# File lib/wayback_archiver/sitemap.rb, line 53
def sitemap_index?
  root_name == 'sitemapindex'
end
sitemaps() click to toggle source

Return all sitemap URLs defined in Sitemap. @return [Array<String>] of Sitemap URLs defined in Sitemap. @example Get Sitemap URLs defined in Sitemap

sitemap = Sitemap.new(xml)
sitemap.sitemaps
# File lib/wayback_archiver/sitemap.rb, line 30
def sitemaps
  @sitemaps ||= extract_urls('sitemap')
end
urls() click to toggle source

Return all URLs defined in Sitemap. @return [Array<String>] of URLs defined in Sitemap. @example Get URLs defined in Sitemap

sitemap = Sitemap.new(xml)
sitemap.urls
# File lib/wayback_archiver/sitemap.rb, line 21
def urls
  @urls ||= extract_urls('url')
end
urlset?() click to toggle source

Returns true of Sitemap lists regular URLs @return [Boolean] of whether the Sitemap regular URL list @example Check if Sitemap is a regular URL list

sitemap = Sitemap.new(xml)
sitemap.urlset?
# File lib/wayback_archiver/sitemap.rb, line 62
def urlset?
  root_name == 'urlset'
end

Private Instance Methods

extract_urls(node_name) click to toggle source

Extract URLs from Sitemap

# File lib/wayback_archiver/sitemap.rb, line 69
def extract_urls(node_name)
  return document.to_s.each_line.map(&:strip) if plain_document?

  urls = []
  document.root.elements.each("#{node_name}/loc") do |element|
    urls << element.text
  end
  urls
end