Wednesday, November 26, 2008

Soft clustering with exemplar-based model

My previous post discussed affinity propagation, which is a hard clustering algorithm, i.e., it assigns each data point to exactly one cluster. Soft clustering, on the other hand, gives the degree or probability of one data point belonging to each cluster. This could be more flexible than hard clustering in many scenarios. The most common way of doing soft clustering is running EM on a mixture model (typically a Gaussian mixture). The problem is that the objective function (likelihood) may contain many local optima, making the algorithm unlikely to find a good solution in one run.

Remember that affinity propagation is different from typical clustering in that it requires cluster centers to be data points (exemplars). We can extend this idea to soft-clustering, and doing this greatly reduces the search space. Even better, this paper "Convex Clustering with Exemplar-Based Models" at NIPS-07 finds that the search space is actually convex. So, the problem becomes a convex optimization problem, and the global optimum can be found. Of course there's a downside of being exemplar-based, that we have to assume there are data points close enough to the real cluster centers, which is not always true.

Tuesday, November 11, 2008

Clustering by affinity propagation

Affinity propagation (AP) is a novel clustering algorithm, first published in NIPS 05 (Mixture modeling by Affinity Propagation) and then in Science, Feb 07 (Clustering by Passing Messages Between Data Points). Slightly different from common clustering setting, it tries to find a subset of the date points as the cluster centers called "exemplars". It iteratively passes two types of messages between each pair of data points. The first type of messages is called responsibility, which is how much the source date point prefers the target data point as its exemplar. The other type of messages is called availability, which is how much the source data point would like to be the exemplar of the target data point. These two types of messages influence each other (for example, if b is more available to a than c is, then a would probably prefer b to c as its exemplar), so we can update them iteratively until convergence. Then we combine responsibility with availability to determine the clustering. Very intuitive right? A nice feature is that AP can determine the cluster number automatically, based on the a priori setting of how preferably each data point is an exemplar.

What impressed me is that, such an intuitive algorithm is actually derived from a standard inference algorithm on factor graphs. A factor graph can be constructed representing the factorization of the energy function of the clustering as well as the constraint that each cluster center must belong to the cluster. The max-sum algorithm (a message passing algorithm, highly related to the loopy belief propagation algorithm on Bayesian networks) can then be used to find an approximate solution (i.e., clustering) that minimizes the energy function. The original max-sum algorithm has exponential time complexity on this fact graph, but can be simplified to O(n^2), resulting in the AP algorithm.

Some additional story:
One year after the publication of AP in Science, there was a technical comment published in Science, Feb 2008. The authors of the comment found that a 40-year-old algorithm called vertex substitution heuristic or VSH (a greedy search algorithm starting from a random initialization) often produces lower error rate than AP with comparable computational time. So they concluded that "...(AP) although a welcome addition to the clustering literature, does not possess the relative advantages over existing procedures that were touted in (the AP paper)." Apparently they were not very happy with AP. Then there was a response from the AP guys, in the same issue of Science. They confirmed the experimental results in the comment, but they also tested some much larger data sets. What they found is that, AP and VSH had comparable error rates over all the data sets, but VSH took significantly more time on larger data sets. On the largest data set that they tested (with 18k data points), AP took 2 hours while VSH took 10 days. So AP is actually much more efficient. Still, I think it would be nice to see how AP compares with many other widely used clustering algorithms on problems of different sizes. From one of the authors' homepage, it seems that they've done some more comprehensive tests, but the result is yet to be published.

Saturday, November 1, 2008

Learning from positive and unlabeled data

In some machine learning scenario, it's hard or unnatural to get negative training samples. For example, it's easy to find a collection of articles on some topic, but it may be hard to find a collection not on some topic. Another example is, in research community people tend to publish their positive results but not negative results. This leads to the research of one-class classification, which learns a classifier from positive data only. One such technique is one-class SVM. However, one-class classifiers are usually sensitive to parameters, and how to do parameter tuning on them is still largely an open problem (correct me if I'm wrong). You can't tune them by means like cross-validation because you don't have negative data!

So semi-supervised learning comes to rescue. Semi-supervised learning tries to learn a classifier from both labeled and unlabeled data, and unlabeled data is usually easy to get. There have been quite some methods on learning from positive and unlabeled data. A recent one is "Learning Classifiers from Only Positive and Unlabeled Data" (KDD 08). This paper makes the assumption of "selected completely at random", which says that whether a positive sample is labeled or not is independent of the data sample itself. From that they make some nice derivations and basically turn the problem into a traditional classification problem, which is simpler than many of the previous methods. The experimental results show that this method is as good as the best previous work, biased-SVM (which is also much less efficient due to parameter tuning).