class Company::Mapping::TFIDF
TFIDF
class implements Term Frequency Inverse Document Frequency statistic. Term frequency–inverse document frequency,
is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Attributes
idf[RW]
tf[RW]
Public Class Methods
new(corpus)
click to toggle source
# File lib/company/mapping/tfidf/tfidf.rb, line 9 def initialize(corpus) @corpus = corpus end
Public Instance Methods
calculate()
click to toggle source
Calculates the tf-idf weights in the given corpus
# File lib/company/mapping/tfidf/tfidf.rb, line 14 def calculate @tfidf = Hash.new @idf ||= InverseDocumentFrequency.new(@corpus) @tf ||= NormalizedTermFrequency.new(BasicTokenizer.new) @idf_weights = @idf.calculate @corpus.each do |doc| termfreq = @tf.calculate(doc.contents) @tfidf[doc.id] = termfreq.each_with_object({}) do |(term, tf), tfidf_weights| weight = tf * @idf_weights[term] tfidf_weights[term] = weight end end @tfidf end
calculate_tfidf_weights_of_new_document(new_doc)
click to toggle source
Calculates tfidf weights of new incoming document without importing the document in the corpus and re-calculating the tf-idf weights for the entire corpus
# File lib/company/mapping/tfidf/tfidf.rb, line 34 def calculate_tfidf_weights_of_new_document(new_doc) termfreq = @tf.calculate(new_doc.contents) @tfidf[new_doc.id] = termfreq.each_with_object({}) do |(term, tf), tfidf_weights| weight = tf * (@idf_weights[term] || @idf.maxIDF) tfidf_weights[term] = weight end @tfidf end
similarity(doc1_id, doc2_id)
click to toggle source
Calculates tf-idf similarity between two given documents. It is actually the calculated Cosine Similarity by using tf*idf weights.
# File lib/company/mapping/tfidf/tfidf.rb, line 46 def similarity(doc1_id, doc2_id) @tfidf ||= calculate CosineSimilarity.new.calculate(@tfidf[doc1_id], @tfidf[doc2_id]) end