module Edits::JaroWinkler

Implements Jaro-Winkler similarity algorithm.

@see en.wikipedia.org/wiki/Jaro-Winkler_distance

Constants

WINKLER_PREFIX_WEIGHT

Prefix scaling factor for jaro-winkler metric. Default is 0.1 Should not exceed 0.25 or metric range will leave 0..1

WINKLER_THRESHOLD

Threshold for boosting Jaro with Winkler prefix multiplier. Default is 0.7

Public Class Methods

distance( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) click to toggle source

Calculate Jaro-Winkler distance

@note Not a true distance metric, fails to satisfy triangle inequality.

@example

Edits::JaroWinkler.distance("information", "informant")
# => 0.05858585858585863

@param (see distance) @return [Float] distance, from 0.0 (identical) to 1.0 (distant)

# File lib/edits/jaro_winkler.rb, line 63
def self.distance(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)
  1.0 - similarity(seq1, seq2, threshold: threshold, weight: weight)
end
similarity( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) click to toggle source

Calculate Jaro-Winkler similarity of given strings

Adds weight to Jaro similarity according to the length of a common prefix of up to 4 letters, where exists. The additional weighting is only applied when the original similarity passes a threshold.

`Sw = Sj + (l * p * (1 - Dj))`

Where `Sj` is Jaro, `l` is prefix length, and `p` is prefix weight

@example

Edits::JaroWinkler.similarity("information", "informant")
# => 0.9414141414141414

@param seq1 [String, Array] @param seq2 [String, Array] @param threshold [Float] threshold for applying Winkler prefix weighting @param weight [Float] weighting for common prefix, should not exceed 0.25 @return [Float] similarity, from 0.0 (none) to 1.0 (identical)

# File lib/edits/jaro_winkler.rb, line 35
def self.similarity(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)

  sj = Jaro.similarity(seq1, seq2)
  return sj unless sj > threshold

  # size of common prefix, max 4
  max_bound = seq1.length > seq2.length ? seq2.length : seq1.length
  max_bound = 4 if max_bound > 4

  l = 0
  l += 1 until seq1[l] != seq2[l] || l >= max_bound

  l < 1 ? sj : sj + (l * weight * (1 - sj))
end