term_extract - Term Extract

Description:

term_extract extracts proper nouns (named things like 'Manchester United') and ordinary nouns (like 'event') from text documents.

Usage:

An example extracting terms from a piece of content:

require 'term_extract'

content = <<DOC
Business Secretary Vince Cable will stay in cabinet despite
"declaring war" on Rupert Murdoch, says Downing Street.
DOC

terms = TermExtract.extract(content)

Options

The extract method takes an (optional) options hash, that allows the term extractor behaviour to be modified. The following options are available:

Sample usage:

terms = TermExtract.extract(content, :types => :nnp, :include_tags => true)

Term Extraction Types

By default, the term extractor attempts to extract both ordinary nouns and proper nouns, this behaviour can be configured using the types option and specifying :all (for both), :nn (for ordinary nouns) or :nnp (for proper nouns). These codes correspond to the relevent POS tags used during the term extraction process. Sample usage is shown below:

terms = TermExtract.extract(content, :types => :nnp)

Command Line Tool

There is a command line tool that can be used for testing the term extractor. It is best used in conjunction with another tool to extract the relevent content (e.g. pismo) :

pismo http://www.bbc.co.uk/news/uk-politics-12085506 body | ruby -rubygems -e 'puts YAML.parse($stdin.read)[:body].value' | ./term-extract nnp | ruby -rubygems -e 'puts YAML.load($stdin.read)'

Note on Patches/Pull Requests

Acknowledgements

The algorithm and extraction code is based on the original python code at:

pypi.python.org/pypi/topia.termextract/

Copyright and License

GPL v3 - See LICENSE.txt for details. Copyright © 2010, Rob Lee