Module: Textminer

Extended by:
Configuration
Defined in:
lib/textminer/mined.rb,
lib/textminer.rb,
lib/textminer/miner.rb,
lib/textminer/request.rb,
lib/textminer/version.rb,
lib/textminer/response.rb

Overview

Textminer::Miner

Class to give back text mining object

Defined Under Namespace

Classes: Mined, Miner, Request, Response

Constant Summary

VERSION =
"0.1.5"

Class Method Summary (collapse)

Methods included from Configuration

configuration, define_setting

Class Method Details

+ (Object) extract(path)

Thin layer around pdf-reader gem's PDF::Reader

This method is used internally within fetch to parse PDFs.

Examples:

require 'textminer'
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_pdf(true);
# Get full text for an article
out = Textminer.fetch(url: links[0]);
# extract pdf to text
Textminer.extract(out.path)

Parameters:

  • path (String)

    Path to a pdf file downloaded via fetch, or another way.



140
141
142
143
# File 'lib/textminer.rb', line 140

def self.extract(path)
  rr = PDF::Reader.new(path)
  rr.pages.map { |page| page.text }.join("\n")
end

+ (Mined) fetch(url)

Get full text

Work easily for open access papers, but for closed. For non-OA papers, use Crossref's Text and Data Mining service, which requires authentication and pre-authorized IP address. Go to apps.crossref.org/clickthrough/researchers to sign up for the TDM service, to get your key. The only publishers taking part at this time are Elsevier and Wiley.

the url requested, the file path, and parsing the plain text, XML, or extracting text from the pdf.

Examples:

require 'textminer'
# Set authorization
Textminer.configuration do |config|
  config.tdm_key = "<your key>"
end
# Get some elsevier works
res = Textminer.search(member: 78, filter: {has_full_text: true});
links = res.links_xml(true);
# Get full text for an article
out = Textminer.fetch(url: links[0]);
out.url
out.path
out.type
xml = out.parse()
puts xml
xml.xpath('//xocs:cover-date-text', xml.root.namespaces).text
# Get lots of articles
links = links[1..3]
out = links.collect{ |x| Textminer.fetch(url: x) }
out.collect{ |z| z.path }
out.collect{ |z| z.parse }
zz = out[0].parse
zz.xpath('//xocs:cover-date-text', zz.root.namespaces).text

## plain text
# get full text links, here doing xml
links = res.links_plain(true);
# Get full text for an article
res = Textminer.fetch(url: links[0]);
res.url
res.parse

# With open access content - using Pensoft
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_xml(true);
# Get full text for an article
res = Textminer.fetch(url: links[0]);
res.url
res.parse

# OA content - pdfs, using pensoft again
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_pdf(true);
# Get full text for an article
res = Textminer.fetch(url: links[0]);
# url used
res.url
# document type
res.type
# document path on your machine
res.path
# get text
res.parse

Parameters:

  • url (String)

    A url for full text

Returns:

  • (Mined)

    An object of class Mined, with methods for extracting



120
121
122
# File 'lib/textminer.rb', line 120

def self.fetch(url)
  Miner.new(url).perform
end

+ (Array) search(doi: nil, member: nil, filter: nil, limit: nil, options: nil)

Search for papers and get full text links

Examples:

require 'textminer'
# link to full text available
Textminer.search(doi: '10.3897/phytokeys.42.7604')
# no link to full text available
Textminer.search(doi: "10.1371/journal.pone.0000308")
# many DOIs at once
require 'serrano'
dois = Serrano.random_dois(sample: 6)
res = Textminer.search(doi: dois)
res = Textminer.search(doi: ["10.3897/phytokeys.42.7604", "10.3897/zookeys.516.9439"])
res.links
res.links_pdf
res.links_xml
res.links_plain
# only full text available
x = Textminer.search(doi: '10.3816/clm.2001.n.006')
x.links_xml
x.links_plain
x.links_pdf
# no dois
x = Textminer.search(filter: {has_full_text: true})
x.links_xml
x.links_plain
x = Textminer.search(member: 311, filter: {has_full_text: true})
x.links_pdf

Parameters:

  • doi (Array)

    A DOI, digital object identifier

  • options (Array)

    Curl request options

Returns:



48
49
50
# File 'lib/textminer.rb', line 48

def self.search(doi: nil, member: nil, filter: nil, limit: nil, options: nil)
  Request.new(doi, member, filter, limit, options).perform
end