class Amatch::DiceCoefficient

The pair distance between two strings is based on the number of adjacent character pairs, that are contained in both strings. The similiarity metric of two strings s1 and s2 is

2*|union(pairs(s1), pairs(s2))| / |pairs(s1)| + |pairs(s2)|

If it is 1.0 the two strings are an exact match, if less than 1.0 they are more dissimilar. The advantage of considering adjacent characters, is to take account not only of the characters, but also of the character ordering in the original strings.

This metric is very capable to find similarities in natural languages. It is explained in more detail in Simon White's article “How to Strike a Match”, located at this url: www.catalysoft.com/articles/StrikeAMatch.html It is also very similar (a special case) to the method described under citeseer.lcs.mit.edu/gravano01using.html in “Using q-grams in a DBMS for Approximate String Processing.”

Public Class Methods

new(pattern) click to toggle source

Creates a new Amatch::PairDistance instance from pattern.

static VALUE rb_PairDistance_initialize(VALUE self, VALUE pattern)
{
    GET_STRUCT(PairDistance)
    PairDistance_pattern_set(amatch, pattern);
    return self;
}

Public Instance Methods

match(strings, regexp = /\s+/) → results click to toggle source

Uses this Amatch::PairDistance instance to match PairDistance#pattern against strings. It returns the pair distance measure, that is a returned value of 1.0 is an exact match, partial matches are lower values, while 0.0 means no match at all.

strings has to be either a String or an Array of Strings. The argument regexp is used to split the pattern and strings into tokens first. It defaults to /s+/. If the splitting should be omitted, call the method with nil as regexp explicitly.

The returned results is either a Float or an Array of Floats respectively.

static VALUE rb_PairDistance_match(int argc, VALUE *argv, VALUE self)
{
    VALUE result, strings, regexp = Qnil;
    int use_regexp;
    GET_STRUCT(PairDistance)

    rb_scan_args(argc, argv, "11", &strings, &regexp);
    use_regexp = NIL_P(regexp) && argc != 2;
    if (TYPE(strings) == T_STRING) {
        result = PairDistance_match(amatch, strings, regexp, use_regexp);
    } else {
        int i;
        Check_Type(strings, T_ARRAY);
        result = rb_ary_new2(RARRAY_LEN(strings));
        for (i = 0; i < RARRAY_LEN(strings); i++) {
            VALUE string = rb_ary_entry(strings, i);
            if (TYPE(string) != T_STRING) {
                rb_raise(rb_eTypeError,
                    "array has to contain only strings (%s given)",
                    NIL_P(string) ?
                        "NilClass" :
                        rb_class2name(CLASS_OF(string)));
            }
            rb_ary_push(result,
                PairDistance_match(amatch, string, regexp, use_regexp));
        }
    }
    pair_array_destroy(amatch->pattern_pair_array);
    amatch->pattern_pair_array = NULL;
    return result;
}
Also aliased as: similar
pattern → pattern string

Returns the current pattern string of this Amatch::Sellers instance.

pattern=(pattern)

Sets the current pattern string of this Amatch::Sellers instance to pattern.

Uses this Amatch::PairDistance instance to match PairDistance#pattern against strings. It returns the pair distance measure, that is a returned value of 1.0 is an exact match, partial matches are lower values, while 0.0 means no match at all.

strings has to be either a String or an Array of Strings. The argument regexp is used to split the pattern and strings into tokens first. It defaults to /s+/. If the splitting should be omitted, call the method with nil as regexp explicitly.

The returned results is either a Float or an Array of Floats respectively.

Alias for: match