module SQ

This module provide some tools to bulk-download a set of PDF documents, all linked in one HTML page.

Public Class Methods

format(doc, fmt='%s.pdf', opts={}) click to toggle source

Output a formatted filename. @param doc [Hash] as returned from SQ.query. @param fmt [String] format. See the project's README for more info on

available format options

@param opts [Hash] additional info. Supported keys include: :number

(the current number), +:count+ (total files count).

@return [String]

# File lib/sq.rb, line 57
def format(doc, fmt='%s.pdf', opts={})
  opts[:number] ||= 0
  opts[:count]  ||= 0

  padded_fmt = "%0#{Math.log(opts[:count], 10).ceil}d"

  fmt.gsub(/%./) do |f|
    case f
    when '%n' then opts[:number]
    when '%N' then opts[:number]+1
    when '%z' then padded_fmt % opts[:number]
    when '%Z' then padded_fmt % (opts[:number]+1)
    when '%c' then opts[:count]
    when '%s' then doc[:name].sub(/\.pdf$/i, '')
    when '%S' then doc[:text]
    when '%_' then doc[:text].gsub(/\s+/, '_')
    when '%-' then doc[:text].gsub(/\s+/, '-')
    when '%%' then '%'
    end
  end
end
process(uri, regex=/./, opts={}) click to toggle source

Query an URI and download all PDFs which match the regex. @param uri [String] @param regex [Regexp] Regex to use to match PDF URIs @param opts [Hash] Supported options: :verbose, :directory

(specify the directory to use for output instead of
the current one), and +:format+ the output format.
See the README for details.

@return [Integer] number of downloaded PDFs.

# File lib/sq.rb, line 87
def process(uri, regex=/./, opts={})
  uris = self.query(uri, regex)
  count = uris.count

  puts "Found #{count} PDFs:" if opts[:verbose]

  return 0 if uris.empty?

  out = File.expand_path(opts[:directory] || '.')
  fmt = opts[:format] || '%s.pdf'

  unless Dir.exists?(out)
    puts "-> mkdir #{out}" if opts[:verbose]
    FileUtils.mkdir_p(out)
  end

  p = ProgressBar.create(:title => "PDFs", :total => count)
  i = 0

  uris.each do |u|
    name = format(u, fmt, {:number => i, :count => count})
    i += 1
    open("#{out}/#{name}", 'wb') do |f|
      open(u[:uri], 'rb') do |resp|
        f.write(resp.read)
        p.log name if opts[:verbose]
        p.increment
      end
    end
  end.count
end
query(uri, regex=/./) click to toggle source

Query an URI and return a list of PDFs. Each PDF is an hash with three keys: :uri is its absolute URI, :name is its name (last part of its URI), and :text is each link text. @param uri [String] @param regex [Regexp] @return [Array<Hash>]

# File lib/sq.rb, line 24
def query(uri, regex=/./)
  uri = 'http://' + uri unless uri =~ /^https?:\/\//

  doc = Nokogiri::HTML(open(uri, 'User-Agent' => user_agent))
  links = doc.css('a[href]')

  uris = links.map do |a|
    full = begin
             URI.join(uri, a.attr('href'))
           rescue
             nil
           end

    [a.text, full]
  end
  uris.select! { |_,u| u && u.path =~ /\.pdf$/i && u.to_s =~ regex }

  uris.map do |text,u|
    {
      :uri => u.to_s,
      :name => u.path.split('/').last,
      :text => text
    }
  end
end
user_agent() click to toggle source

@return [String] the user-agent used by SQ

# File lib/sq.rb, line 14
def user_agent
  "SQ/#{version} +github.com/bfontaine/sq"
end
version() click to toggle source

@return [String] current gem's version

# File lib/version.rb, line 6
def version
  '0.1.4'
end