module GreenMidget

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

A mixin that implements features check and allows Base sublcasses to define their own features for spam/ham detection.

By default texts are checked for presence of external URL or email references. An example of addional feature would be presence of particular words or expressions.

See the example in ‘lib/green_midget/extensions/sample.rb`

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

A mixin that implements heuritics checks for both categories. If there’re some conditions under which a spammable object could directly be classified as one of the classification categories the logic could be implemented using heuritic checks in your subclasses

See the example in ‘lib/green_midget/extensions/sample.rb`

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

This is an abstraction from Words, Examples and Features. It provides common methods for building the record keys for individual countables in any category.

For example the data record key for the word ‘legit’ in Spam category would be something like “word::legit::spam_count”. The record key for a feature ‘url_present’ in Ham would be something like “feature::url_present::ham_count” The count of all training examples given for category Spam would be “example::any::spam_count”

The example counts for individual features is stored as well. For example for ‘url_present’ we will have two records: “example::url_present::spam_count” and “example::url_present::ham_count”. They will store the informatino about how much training the GreenMidget received for this feature in each category.

This class is the link between countable and the Records data store adapter

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

A model for Examples used in GreenMidget. Examples represent the counts for how much training GreenMidget received in each respective category.

Example['url_present'][:spam]
# the number of spam training examples having an URL

Example['any'][:ham]
# the number of total Ham training examples

See Countable

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

A model for Features used in GreenMidget. A Feature could be defined by user. An example would be ‘url_found_in_text’ which will be true for spammable objects that have url in their text and false otherwise.

Features['url_in_text'][:spam]
# the count of spam messages that have the feature

See Countable

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

GreenMidget’s simple data store adapter with only three public methods. It’s currently ActiveRecord based but a plan is to make a Redis based extension as well.

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

A model for Words used in GreenMidget. See Countable

Copyright © 2011, SoundCloud Ltd., Nikola Chochkov

Constants

ACCEPT_ALTERNATIVE_MIN

Decision making: Log(Pr(alternative | text)) - Log(Pr(null | text)) <=> ( REJECT_ALTERNATIVE_MAX..ACCEPT_ALTERNATIVE_MIN )

ALTERNATIVE
CATEGORIES
EMAIL_REGEX
FEATURES
MAX_CHARACTERS_IN_WORD
MIN_CHARACTERS_IN_WORD
NULL
REJECT_ALTERNATIVE_MAX
RESPONSES
STOP_WORDS
TOLERATED_URLS
URL_REGEX
VERSION
WORDS_SPLIT_REGEX