class Primal::InputTermExtraction
The InputTermExtraction
class abstracts accessing AlchemyAPI (www.alchemyapi.com/) to extract important information from Web pages/News articles/blog posts/plain text.
The main function is getPrimalRequest, which accepts a string that represents a Web page URL or some text, and build a Primal
topic URI that will direct the user to the Primal
Web App.
Public Class Methods
Constructor for the InputTermExtraction
class
Pass in the Api Key for Alchemy services. You can register for a free API key here:
http://www.alchemyapi.com/api/register.html
to test your application. Please read the license of Alchemy API.
# File lib/primal/AlchemyAPIWrapper.rb, line 58 def initialize(alchemyApiKey) @alchemyApiKey = alchemyApiKey end
Public Instance Methods
Uses the deconstructed Alchemy information to create a valid Primal
URL
# File lib/primal/AlchemyAPIWrapper.rb, line 163 def buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) # Check if any of the extractions failed if !categoryJSON or !entitiesJSON or !keywordsJSON $stderr.puts "Cannot build Primal request. Alchemy failed to extract information." return nil end if @@debugMe $stderr.puts "Building Primal request..." end ### Get information required for building a Primal request # Get the category from the extracted data category = rewriteCategory(categoryJSON['category']) if @@debugMe $stderr.puts "Category = #{category}" end ### Select top entities from all extracted entities entitiesList = entitiesJSON['entities'].collect { |entity| entity['text'].downcase }[0, @@entitiesLimit] if @@debugMe prettified = entitiesJSON['entities'].collect { |entity| entity['text'] }.join(', ') $stderr.puts "Entities = #{prettified}" end ### Select top keywords from all extracted keywords # Remove keywords that intersect with entities of the types in @@categoryIgnores allEntities = entitiesJSON['entities'].select { |entity| @@categoryIgnores.has_key? entity['type'].downcase }.collect { |entity| entity['text'].downcase } if @@debugMe prettified = keywordsJSON['keywords'].collect { |keyword| keyword['text'] }.join(', ') $stderr.puts "Keywords = #{prettified}" end keywordsList = keywordsJSON['keywords'].select { |keyword| normalizedKw = keyword['text'].downcase intersectsWithEntity = !(allEntities.select { |entity| entity.include? normalizedKw or normalizedKw.include? entity }.empty?) # Ignore keywords > 4 words or those that intersect with entities normalizedKw.split.size < 5 && !intersectsWithEntity }.collect { |keyword| keyword['text'].downcase } # Remove repeated keywords keywordsList = getNonRepeatedKeywords(keywordsList) ### Build Primal topic URI primalRequest = "" unless category.nil? then primalRequest = "/" + category end if entitiesList.size > 0 then primalRequest = primalRequest + "/" + entitiesList.join("/") end if keywordsList.size > 0 then primalRequest = primalRequest + "/" + keywordsList.join(";") end if @@debugMe $stderr.puts "Primal request = #{primalRequest}" end URI::encode(primalRequest) end
Perform a GET request to Alchemy service URL and return the response as a JSON object
Returns nil on error
# File lib/primal/AlchemyAPIWrapper.rb, line 313 def getAlchemy(serviceURL, parameters) response = self.class.get(serviceURL, parameters) returnAlchemyResponseJSON(response) end
Returns the top @@keywordsLimit keywords, ignoring those contained within other keywords
# File lib/primal/AlchemyAPIWrapper.rb, line 238 def getNonRepeatedKeywords(keywordsList) # If there is less than @@keywordsLimit, or if the first @@keywordsLimit keywords # are unique, return the first @@keywordsLimit keywords repeatedKeywords = getRepeatedKeywords(keywordsList[0, @@keywordsLimit]) if keywordsList.length <= @@keywordsLimit or repeatedKeywords.empty? keywordsList[0, @@keywordsLimit] else # Remove repeated elements from the first @@keywordsLimit keywords, and recursively # call this function getNonRepeatedKeywords(keywordsList - repeatedKeywords) end end
Receives a string that represents a Web page URL or some text, and returns a Primal
topic URI.
Returns nil on error
# File lib/primal/AlchemyAPIWrapper.rb, line 68 def getPrimalRequest(urlOrText) if isURI(urlOrText) getPrimalRequestURL(urlOrText) else getPrimalRequestTEXT(urlOrText) end end
Processes the given Text at Alchemy and then translates the results to a valid Primal
URL
# File lib/primal/AlchemyAPIWrapper.rb, line 128 def getPrimalRequestTEXT(textToProcess) if @@debugMe $stderr.puts "Extracting information from text..." end # get category categoryJSON = postAlchemy("#{@@alchemyText}/TextGetCategory", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) # get entities entitiesJSON = postAlchemy("#{@@alchemyText}/TextGetRankedNamedEntities", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) # get keywords keywordsJSON = postAlchemy("#{@@alchemyText}/TextGetRankedKeywords", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) end
Processes the given URL at Alchemy and then translates the results to a valid Primal
URL
# File lib/primal/AlchemyAPIWrapper.rb, line 92 def getPrimalRequestURL(urlToProcess) if @@debugMe $stderr.puts "Extracting information from URL..." end # get category of the Web page categoryJSON = getAlchemy("#{@@alchemyURL}/URLGetCategory", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) # get entities in the Web page entitiesJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedNamedEntities", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) # get keywords from the Web page keywordsJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedKeywords", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) end
Returns any repeated words in the list
# File lib/primal/AlchemyAPIWrapper.rb, line 254 def getRepeatedKeywords(keywordsList) keywordsList.select { |keyword| not(keywordsList.select { |other| other != keyword and other.include? keyword }.empty?) } end
Indicates whether or not a given string represents a URL
# File lib/primal/AlchemyAPIWrapper.rb, line 79 def isURI(string) uri = URI.parse(string) %w( http https ).include?(uri.scheme) rescue URI::BadURIError false rescue URI::InvalidURIError false end
Perform a POST request to Alchemy service URL and return the response as a JSON object
Returns nil on error
# File lib/primal/AlchemyAPIWrapper.rb, line 302 def postAlchemy(serviceURL, parameters) response = self.class.post(serviceURL, parameters) returnAlchemyResponseJSON(response) end
Return the body of the response in a JSON object or nil on error
# File lib/primal/AlchemyAPIWrapper.rb, line 322 def returnAlchemyResponseJSON(response) code = response.code body = response.body bodyJSON = JSON.parse(body) # A statusInfo field contains the details of the error if bodyJSON['status'] != "OK" puts bodyJSON['statusInfo'] nil else bodyJSON end end
Modifies the extracted category string to become a clear topic in the Primal
request.
AlchemyAPI categorizes text into a limited set of category types.
See http://www.alchemyapi.com/api/categ/categs.html for a complete list.
Some of the cateogy type names have two strings concatenated by an underscore character. This function selects one of the two strings (or a totally new string) to be the topic in the Primal
request.
# File lib/primal/AlchemyAPIWrapper.rb, line 274 def rewriteCategory(category) case category when "unknown" # AlchemyAPI failed to classify the text category = nil when "arts_entertainment" category = "arts" # rewrite to 'arts' when "computer_internet" category = "technology" # rewrite to 'technology', a clearer topic for this category when "culture_politics" category = "politics" # rewrite to 'politics' when "law_crime" category = "law" # rewrite to 'law' when "science_technology" category = "science" # rewrite to 'science' else # The previous conditions should cover all the categories extracted by # Alchemy. In case of a new category that contains an underscore, replace # it and keep the two words as the topic. category = category.sub('_', ' ') end end