class Ai::Nlp::Hasher

Class managing an n-gram hash

Public Class Methods

new(input) click to toggle source

Initialisation @param string input The string to treat

# File lib/ai/nlp/n_gram/hasher.rb, line 18
def initialize(input)
  @input = input
  @hash = {}
  clean
end

Public Instance Methods

calculate() click to toggle source

Calculates n-gram frequencies for the dataset @return Frequencies of ngram or sorted array

# File lib/ai/nlp/n_gram/hasher.rb, line 27
def calculate
  @input.split(/[\d\s\[\]]/).each do |word|
    calculate_word_gram("_#{word}_")
  end
  drop_unwanted_keys
  @hash.sort { |one, other| other[1] <=> one[1] }
end

Private Instance Methods

calculate_letter_gram(parameters) click to toggle source

Stores the mono-gram, bi-gram and tri-gram in the hash @param hash parameters The list of necessary parameters :

- letter_position The position of the letter to be processed
- word The word treated
- length Current word size
# File lib/ai/nlp/n_gram/hasher.rb, line 63
def calculate_letter_gram(parameters)
  (1..3).each do |nth|
    letters = parameters[:word][parameters[:letter_position], nth]
    next unless letters
    init_key(letters)
    @hash[letters] += 1 if parameters[:length] > (nth - 1)
  end
end
calculate_word_gram(word) click to toggle source

Enriched hash representing the n-gram of a word @param string word The word to calculate

# File lib/ai/nlp/n_gram/hasher.rb, line 40
def calculate_word_gram(word)
  length = word.size
  (0..length).each do |letter_position|
    parameters = { letter_position: letter_position, word: word, length: length }
    calculate_letter_gram(parameters)
    length -= 1
  end
end
clean() click to toggle source

Cleans the string passed as argument

# File lib/ai/nlp/n_gram/hasher.rb, line 81
def clean
  safe_clean
  specific_clean
  clean_latin
  @input = @input.strip.split(" ").join(" ")
end
clean_latin() click to toggle source

Cleans the string from Latin characters if more than half of the string is not Latin.

# File lib/ai/nlp/n_gram/hasher.rb, line 90
def clean_latin
  latin = @input.scan(/[a-z]/)
  nonlatin = @input.scan(/[\p{L}&&[^a-z]]/)
  nonlatin_ratio = nonlatin.size / (latin.size * 1.0)
  return if nonlatin_ratio < 0.5
  @input.gsub!(/[a-zA-Z]/, "") if !latin.empty? && !nonlatin.empty?
end
drop_unwanted_keys() click to toggle source

Deletes a key if its value is less than or equal to zero

# File lib/ai/nlp/n_gram/hasher.rb, line 51
def drop_unwanted_keys
  @hash.each_key do |key|
    @hash.delete(key) if key.size <= 0
  end
end
init_key(letters) click to toggle source

Initialize key if necessary @param string letters The group of letters

# File lib/ai/nlp/n_gram/hasher.rb, line 75
def init_key(letters)
  @hash[letters] ||= 0
end
safe_clean() click to toggle source

Cleaning via existing tools

# File lib/ai/nlp/n_gram/hasher.rb, line 111
def safe_clean
  @input = Sanitize.clean(@input)
  @input = CGI.unescapeHTML(@input)
  @input = Unicode.downcase(@input)
end
specific_clean() click to toggle source

Removes polluting web addresses, mails and characters

# File lib/ai/nlp/n_gram/hasher.rb, line 100
def specific_clean
  uri_regex = %r/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/
  @input.gsub!(uri_regex, "")
  # Remove mails
  @input.gsub!(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/, "")
  # Repleace polluting non-alphabetical characters, punctuation included by a space
  @input.gsub!(%r/[\*\^><!\"#\$%&\'\(\)\*\+:;,\._\/=\?@\{\}\[\]|\-\n\r0-9]/, " ")
end