Saturday, November 1, 2008

Learning from positive and unlabeled data

In some machine learning scenario, it's hard or unnatural to get negative training samples. For example, it's easy to find a collection of articles on some topic, but it may be hard to find a collection not on some topic. Another example is, in research community people tend to publish their positive results but not negative results. This leads to the research of one-class classification, which learns a classifier from positive data only. One such technique is one-class SVM. However, one-class classifiers are usually sensitive to parameters, and how to do parameter tuning on them is still largely an open problem (correct me if I'm wrong). You can't tune them by means like cross-validation because you don't have negative data!

So semi-supervised learning comes to rescue. Semi-supervised learning tries to learn a classifier from both labeled and unlabeled data, and unlabeled data is usually easy to get. There have been quite some methods on learning from positive and unlabeled data. A recent one is "Learning Classifiers from Only Positive and Unlabeled Data" (KDD 08). This paper makes the assumption of "selected completely at random", which says that whether a positive sample is labeled or not is independent of the data sample itself. From that they make some nice derivations and basically turn the problem into a traditional classification problem, which is simpler than many of the previous methods. The experimental results show that this method is as good as the best previous work, biased-SVM (which is also much less efficient due to parameter tuning).

2 comments:

Anonymous said...

I am wondering how to step into this field---machine learning, It seems various learning methods and techniques are under discussion, and how could you manage to find one?

TU said...

If you are asking how to find a research topic... I guess you may spend one year or two reading/studying to get a big picture of the field, and you don't need to understand the details of every single method that you come across, except those classic ones. In this process you will be able to find some interesting topics that you'd like to work on, and then you can study the specific sub-fields in more detail, which may bring to you some further interesting topics/ideas, and so on. Just my two cents :)