class SiteValidator::Sitemap

A sitemap has an URL, and holds a collection of pages to be validated

Attributes

max_pages[RW]
url[RW]

Public Class Methods

new(url, max_pages = 100) click to toggle source
# File lib/site_validator/sitemap.rb, line 12
def initialize(url, max_pages = 100)
  @url       = url
  @max_pages = max_pages
end

Public Instance Methods

errors() click to toggle source

Returns the combined validation errors of all the pages

# File lib/site_validator/sitemap.rb, line 25
def errors
  @errors ||= pages.map {|p| p.errors}.flatten.reject {|e| e.nil?}
end
get_binding() click to toggle source

Returns the binding, needed to paint the ERB template when generating the HTML report (see site_validator/reporter.rb)

# File lib/site_validator/sitemap.rb, line 38
def get_binding
  binding
end
pages() click to toggle source

Returns the first 250 unique URLs from the sitemap

# File lib/site_validator/sitemap.rb, line 19
def pages
  @pages ||= pages_in_sitemap.uniq {|p| p.url}[0..max_pages-1]
end
warnings() click to toggle source

Returns the combined validation warnings of all the pages

# File lib/site_validator/sitemap.rb, line 31
def warnings
  @warnings ||= pages.map {|p| p.warnings}.flatten.reject {|e| e.nil?}
end

Private Instance Methods

doc() click to toggle source
# File lib/site_validator/sitemap.rb, line 94
def doc
  @doc ||= scraped_doc.to_s
end
looks_like_html?(url) click to toggle source

Tells if the given url looks like an HTML page. That is, it does not look like javascript, image, pdf…

# File lib/site_validator/sitemap.rb, line 74
def looks_like_html?(url)
  u         = URI.parse(URI.encode(url))
  scheme    = u.scheme                if u.scheme
  extension = u.path.split(".").last  if u.path && u.path.split(".").size > 1

  (scheme =~ /http[s]?/i) && (extension !~ /gif|jpg|jpeg|png|tiff|bmp|txt|pdf|mobi|epub|doc|rtf|xml|xls|csv|wav|mp3|ogg|zip|rar|tar|gz/i)
rescue URI::InvalidURIError
  false
end
pages_in_sitemap() click to toggle source

Scrapes the url in search of links.

It first assumes it's an XML sitemap; if no locations found, it will try to scrape the links from HTML.

For HTML sources, it will only get the links that start with the sitemap root url, convert relative links to absolute links, remove anchors from links, include the sitemap url, and exclude links that don't seem to point to HTML (like images, multimedia, text, javascript…)

# File lib/site_validator/sitemap.rb, line 52
def pages_in_sitemap
  pages = xml_locations.select {|loc| looks_like_html?(loc.text.strip)}.map {|loc| SiteValidator::Page.new(loc.text.strip)}

  if pages.empty?
    m     = scraped_doc
    links = [m.url]

    m.links.internal.select {|l| looks_like_html?(l)}.map {|l| l.split('#')[0]}.uniq.each do |link|
      if link[-1,1] == "/"
        links << link unless (links.include?(link) || links.include?(link.chop))
      else
        links << link unless (links.include?(link) || links.include?("#{link}/"))
      end
    end

    pages = links.map {|link| SiteValidator::Page.new(link)}
  end
  pages
end
scraped_doc() click to toggle source
# File lib/site_validator/sitemap.rb, line 88
def scraped_doc
  @scraped_doc ||= MetaInspector.new(url, headers: {'User-Agent'      => SiteValidator::USER_AGENT,
                                                    'Accept-Encoding' => 'none' },
                                          faraday_options: { ssl: { verify: false } })
end
xml_locations() click to toggle source
# File lib/site_validator/sitemap.rb, line 84
def xml_locations
  Nokogiri::XML(doc).css('loc')
end