Sunday, October 19, 2008

Learn to filter spam

Almost everyday we receive lots of spam emails. Training classifiers to filter spam is one of the most successful real-life applications of machine learning. But spammers are getting smart to counteract spam filters! One trick they use is good word attack. By inserting a set of words that often appear in legitimate messages but not in spams, they make spams look legitimate and confuse the filters (increasing false negative). Even worse, when end users label such emails as spams, adaptive spam filters will learn from these new training samples, and may associate those good words with spam (increasing false positive). These days we do sometimes receive such emails, don't we?

So this paper, "A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters" (JMLR Jun 08), proposed to use multi-instance learning for spam filters. Multi-instance learning learns from and makes predictions on bags of instances, instead of individual instances. If at least one of the instances in a bag is positive, then the bag is positive; otherwise, the bag is negative. Let a bag be an email, and an instance be a part of the email. You see this is a perfect match for good word attack. Now I'm wondering what spammers will do next....

4 comments:

Unknown said...

Hi,guy! In my opinin, you might misuse the "false positive" and "false negtive" just the reverse.
I apperciate these interesting posts as well as your motivation, wish more discussion with you in future.

TU said...

Thank you :)
For the "false positive/negative" issue, here the classifier is trained to identify spam, so "positive" means "classified as spam". This is also the way used by that paper.

Unknown said...

Thanks for your illumination,I have misunderstood it.

thinktank said...

Hello
This is asraful.
I am new in AI.
And i am interested about "Complex Mapping in Ontology Alignment"
So far i found ur research blog helpful for me.