Sunday, October 19, 2008

Learn to filter spam

Almost everyday we receive lots of spam emails. Training classifiers to filter spam is one of the most successful real-life applications of machine learning. But spammers are getting smart to counteract spam filters! One trick they use is good word attack. By inserting a set of words that often appear in legitimate messages but not in spams, they make spams look legitimate and confuse the filters (increasing false negative). Even worse, when end users label such emails as spams, adaptive spam filters will learn from these new training samples, and may associate those good words with spam (increasing false positive). These days we do sometimes receive such emails, don't we?

So this paper, "A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters" (JMLR Jun 08), proposed to use multi-instance learning for spam filters. Multi-instance learning learns from and makes predictions on bags of instances, instead of individual instances. If at least one of the instances in a bag is positive, then the bag is positive; otherwise, the bag is negative. Let a bag be an email, and an instance be a part of the email. You see this is a perfect match for good word attack. Now I'm wondering what spammers will do next....

Sunday, October 12, 2008

Feature selection and graphical model structure learning

Feature selection is a technique used in machine learning (classification or regression) to select a subset of the input features so that the problem size is decreased while the learning performance is not compromised. Graphical model structure learning is to learn the graph structure that best represents the independency relations among the random variables of the target probabilistic distribution. At first glance, the two don't seem to be related. But think about it, feature selection tries to find the minimal set of variables that are sufficient to predict the target variable, which is the Markov blanket of the target variable; and finding the Markov blanket is obviously helpful in learning the graphical model structure (and the reverse is also true). This is exactly what this paper "Using Markov Blankets for Causal Structure Learning" in a recent issue of JMLR does.

Btw, the first two posts of this blog are both about learning graphical model structure, but... that's mere coincidence! My current research is only remotely related to this area. I'll talk about some other topic in the next post.

Thursday, October 2, 2008

Graphical model structure learning with unreliable tests

Just listened to an interesting talk by Dimitris Margaritis on his work (with Facundo Bromberg) "Improving the Reliability of Causal Discovery from Small Data Sets using Argumentation", which will be published in JMLR. It's a new constraint based method of learning graphical model structures, in the presence of inaccurate independence tests.

Learning graphical model structures is an important but difficult problem in machine learning, in which one has to learn the structure of the graphical model from a set of data generated by the model. There are two types of methods for that, score-based and constraint-based. The former searches for a structure (usually by greedy search) that maximizes some score, e.g. likelihood or posterior. The latter does independence tests among variables, takes the results as constraints, and solve a constraint satisfaction problem.

Traditionally, in the constraint-based method, each independence test removes those candidate structures inconsistent with the test result (according to a set of axioms). This actually assumes the tests to be reliable, which however is mostly untrue in practice. So what can be done when facing unreliable independence tests? This work presented by Margaritis proposes to do multiple tests and find out all the contradictions among the results. There are also a reliability score associated with each test (e.g., p-value). So now we can try to accept and reject testing results to make them consistent as well as optimize the overall reliability. Margaritis actually uses a simple, somewhat greedy method which, in my understanding, may not optimize the reliability, but the experimental results are still better than the traditional methods that don't consider the reliability issue.