module GreenMidget
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
A mixin that implements features check and allows Base
sublcasses to define their own features for spam/ham detection.
By default texts are checked for presence of external URL or email references. An example of addional feature would be presence of particular words or expressions.
See the example in ‘lib/green_midget/extensions/sample.rb`
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
A mixin that implements heuritics checks for both categories. If there’re some conditions under which a spammable object could directly be classified as one of the classification categories the logic could be implemented using heuritic checks in your subclasses
See the example in ‘lib/green_midget/extensions/sample.rb`
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
This is an abstraction from Words
, Examples
and Features
. It provides common methods for building the record keys for individual countables in any category.
For example the data record key for the word ‘legit’ in Spam category would be something like “word::legit::spam_count”. The record key for a feature ‘url_present’ in Ham would be something like “feature::url_present::ham_count” The count of all training examples given for category Spam would be “example::any::spam_count”
The example counts for individual features is stored as well. For example for ‘url_present’ we will have two records: “example::url_present::spam_count” and “example::url_present::ham_count”. They will store the informatino about how much training the GreenMidget
received for this feature in each category.
This class is the link between countable and the Records
data store adapter
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
A model for Examples
used in GreenMidget
. Examples
represent the counts for how much training GreenMidget
received in each respective category.
Example['url_present'][:spam] # the number of spam training examples having an URL Example['any'][:ham] # the number of total Ham training examples
See Countable
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
A model for Features
used in GreenMidget
. A Feature could be defined by user. An example would be ‘url_found_in_text’ which will be true for spammable objects that have url in their text and false otherwise.
Features['url_in_text'][:spam] # the count of spam messages that have the feature
See Countable
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
GreenMidget’s simple data store adapter with only three public methods. It’s currently ActiveRecord based but a plan is to make a Redis based extension as well.
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
A model for Words
used in GreenMidget
. See Countable
Copyright © 2011, SoundCloud Ltd., Nikola Chochkov
Constants
- ACCEPT_ALTERNATIVE_MIN
Decision making: Log(Pr(alternative | text)) - Log(Pr(null | text)) <=> (
REJECT_ALTERNATIVE_MAX
..ACCEPT_ALTERNATIVE_MIN )- ALTERNATIVE
- CATEGORIES
- EMAIL_REGEX
- EXTERNAL_LINK_REGEX
- FEATURES
- MAX_CHARACTERS_IN_WORD
- MIN_CHARACTERS_IN_WORD
- NULL
- REJECT_ALTERNATIVE_MAX
- RESPONSES
- STOP_WORDS
- TOLERATED_URLS
- URL_REGEX
- VERSION
- WORDS_SPLIT_REGEX