class Hermaeus::Client
Public: Wraps a reddit client for access to reddit's API, and provides methods for downloading posts from reddit.
Constants
- USER_AGENT
Public Class Methods
Public: Connects the Hermaeus::Client
to reddit.
# File lib/hermaeus/client.rb, line 18 def initialize Config.validate! cfg = Config.info[:client] @client = Redd.it(cfg.delete(:type).to_sym, *cfg.values, user_agent: USER_AGENT) @client.authorize! @html_filter = HTMLEntities.new end
Public Instance Methods
Public: Transforms a list of raw reddit links (“/r/SUB/comments/ID/NAME”) into their reddit fullname (“t3_ID”).
data - A String Array such as that returned by get_global_listing.
Optional parameters:
regex: A Regular Expression used to match the reddit ID out of a link.
Returns a String Array containing the reddit fullnames harvested from the input list. Input elements that do not match are stripped.
# File lib/hermaeus/client.rb, line 65 def get_fullnames data, **opts # TODO: Move this regex to the configuration file. regex = opts[:regex] || %r(/r/.+/(comments/)?(?<id>[0-9a-z]+)/.+) data.map do |item| m = item.match regex "t3_#{m[:id]}" if m end .reject { |item| item.nil? } end
Public: Scrapes the Compilation full index.
Wraps Client#scrape_index
; see it for documentation.
# File lib/hermaeus/client.rb, line 29 def get_global_listing **opts scrape_index Config.info[:index][:path], opts end
Public: Collects posts from reddit.
fullnames - A String Array of reddit fullnames (“tNUM_ID”, following reddit documentation) to query.
Yields a sequence of Hashes, each describing a reddit post.
Returns an Array of the response bodies from the reddit call(s).
Examples
get_posts
get_fullnames
get_global_listing
do |post|
puts post[:selftext] # Prints the Markdown source of each post
end
> returns an array of hashes, each of which includes an array of posts.¶ ↑
# File lib/hermaeus/client.rb, line 90 def get_posts fullnames, &block ret = [] # reddit has finite limits on acceptable query sizes. Split the list into # manageable portions fullnames.each_slice(100).each do |chunk| # Assemble the list of reddit objects being queried query = "/by_id/#{chunk.join(",")}.json" response = scrape_posts query, &block ret << response.body end ret end
Public: Scrapes a Weekly Community Thread patch index.
ids - A String Array of reddit post IDs for Weekly Community Threads.
Examples:
get_weekly_listing
“56j7pq” # Targets one Community Thread get_weekly_listing
“56j7pq”, “55erkr” # Targets two Community Threads get_weekly_listing
“55erkr”, css: “td:last-child a” # Custom CSS selector
Wraps Client#scrape_index
; see it for documentation.
# File lib/hermaeus/client.rb, line 44 def get_weekly_listing ids, **opts ids.map! do |id| "t3_#{id}" unless id.match /^t3_/ end # TODO: Ensure that this is safe (only query <= 100 IDs at a time), and # call the scraper multiple times and reassemble output if necessary. query = "/by_id/#{ids.join(",")}" scrape_index query, opts end
Private Instance Methods
Internal: Governs the actual functionality of the index scrapers.
path - The reddit API or path being queried. It can be a post ID/fullname or a full URI.
Optional parameters:
css: The CSS selector string used to get the links referenced on the page.
Returns an array of all the referenced links. These links will need to be broken down into reddit fullnames before Hermaeus
can download them.
# File lib/hermaeus/client.rb, line 116 def scrape_index path, **opts # This is a magic string that targets the index format /r/teslore uses to # enumerate their Compendium, in the wiki page and weekly patch posts. query = opts[:css] || "td:first-child a" # Reddit will respond with an HTML dump, if we are querying a wiki page, # or a wrapped HTML dump, if we are querying a post. fetch = @client.get(path).body # Set fetch to be an array of hashes which have the desired text as a # direct child. if fetch[:kind] == "wikipage" fetch = [fetch[:data]] elsif fetch[:kind] == "Listing" fetch = fetch[:data][:children].map { |c| c[:data] } end # reddit will put the text data in :content_html if we queried a wikipage, # or :selftext_html if we queried a post. The two keys are mutually # exclusive, so this simply looks for both and remaps fetch items to point # to the actual data. [:content_html, :selftext_html].each do |k| fetch.map! do |item| if item.respond_to?(:has_key?) && item.has_key?(k) item[k] else item end end end # Ruby doesn't like having comments between each successive map block. # This sequence performs the following transformations on each entry in # the fetched list. # 1. Unescape the HTML text. # 2. Process the HTML text into data structures. # 3. Run CSS queries on the data structures to find the links sought. # 4. Unwrap the link elements to get the URI at which they point. # 5. In the event that multiple pages were queried to get data, the array # that each of those queries returns is flattened so that this method only # returns one single array of link URIs. fetch.map do |item| @html_filter.decode(item) end .map do |item| Nokogiri::HTML(item) end .map do |item| item.css(query).map do |item| item.attributes["href"].value end end .flatten end
Internal: Provides the actual functionality for collecting posts.
query - The reddit API endpoint or path being queried. opts - Options for the reddit API call block - This method yields each post fetched to its block. tries - hidden parameter used to prevent infinite stalling on rate limits.
Returns reddit's response to the query.
# File lib/hermaeus/client.rb, line 175 def scrape_posts query, tries = 0, **opts, &block begin # Ask reddit to procure our items response = @client.get(query, opts) if response.success? payload = response.body # The payload should be a Listing even for a single-item query; the # :children array will just have one element. if payload[:kind] == "Listing" payload[:data][:children].each do |item| yield item[:data] end end return response end # If at first you don't succeed... rescue Redd::Error::RateLimited => e sleep e.time + 1 # Try try again. if tries < 3 scrape_posts query, tries + 1 else raise RuntimeError, "reddit rate limit will not unlock" end end end