Months ago, I had a post about the work by Bromberg and Margaritis on graphical model structure learning. Here is another paper by them, "Efficient and Robust Independence-Based Markov Network Structure Discovery" [IJCAI-07], which seems even more interesting to me.
The paper has two main contributions. The first is making the constraint-based structure learning more efficient, by always choosing the most informative and inexpensive test to perform. The second is making the structure learning robust to unreliable test results. This second goal is the same as in their other work discussed in my previous post, but here they use a totally different method: they update the posterior of the structures given the tests, instead of simply accepting or rejecting structures as in traditional constraint-based methods. Then there's the problem that there are exponentially many structures and you can't keep track of all of them. So their solution is, particle filter!
Remember there are two types of methods for structure learning: score-based and constraint-based (aka. independence-based). Although this method is still constraint-based (because it's based on independence tests), I feel it also resembles score-based methods. By using posterior instead of hard constraints, the problem resembles an optimization problem instead of a constraint satisfaction problem.
I'm currently refreshing/supplementing my knowledge on particle filter, and perhaps my next post will be about it.
Wednesday, December 10, 2008
Wednesday, November 26, 2008
Soft clustering with exemplar-based model
My previous post discussed affinity propagation, which is a hard clustering algorithm, i.e., it assigns each data point to exactly one cluster. Soft clustering, on the other hand, gives the degree or probability of one data point belonging to each cluster. This could be more flexible than hard clustering in many scenarios. The most common way of doing soft clustering is running EM on a mixture model (typically a Gaussian mixture). The problem is that the objective function (likelihood) may contain many local optima, making the algorithm unlikely to find a good solution in one run.
Remember that affinity propagation is different from typical clustering in that it requires cluster centers to be data points (exemplars). We can extend this idea to soft-clustering, and doing this greatly reduces the search space. Even better, this paper "Convex Clustering with Exemplar-Based Models" at NIPS-07 finds that the search space is actually convex. So, the problem becomes a convex optimization problem, and the global optimum can be found. Of course there's a downside of being exemplar-based, that we have to assume there are data points close enough to the real cluster centers, which is not always true.
Remember that affinity propagation is different from typical clustering in that it requires cluster centers to be data points (exemplars). We can extend this idea to soft-clustering, and doing this greatly reduces the search space. Even better, this paper "Convex Clustering with Exemplar-Based Models" at NIPS-07 finds that the search space is actually convex. So, the problem becomes a convex optimization problem, and the global optimum can be found. Of course there's a downside of being exemplar-based, that we have to assume there are data points close enough to the real cluster centers, which is not always true.
Tuesday, November 11, 2008
Clustering by affinity propagation
Affinity propagation (AP) is a novel clustering algorithm, first published in NIPS 05 (Mixture modeling by Affinity Propagation) and then in Science, Feb 07 (Clustering by Passing Messages Between Data Points). Slightly different from common clustering setting, it tries to find a subset of the date points as the cluster centers called "exemplars". It iteratively passes two types of messages between each pair of data points. The first type of messages is called responsibility, which is how much the source date point prefers the target data point as its exemplar. The other type of messages is called availability, which is how much the source data point would like to be the exemplar of the target data point. These two types of messages influence each other (for example, if b is more available to a than c is, then a would probably prefer b to c as its exemplar), so we can update them iteratively until convergence. Then we combine responsibility with availability to determine the clustering. Very intuitive right? A nice feature is that AP can determine the cluster number automatically, based on the a priori setting of how preferably each data point is an exemplar.
What impressed me is that, such an intuitive algorithm is actually derived from a standard inference algorithm on factor graphs. A factor graph can be constructed representing the factorization of the energy function of the clustering as well as the constraint that each cluster center must belong to the cluster. The max-sum algorithm (a message passing algorithm, highly related to the loopy belief propagation algorithm on Bayesian networks) can then be used to find an approximate solution (i.e., clustering) that minimizes the energy function. The original max-sum algorithm has exponential time complexity on this fact graph, but can be simplified to O(n^2), resulting in the AP algorithm.
Some additional story:
One year after the publication of AP in Science, there was a technical comment published in Science, Feb 2008. The authors of the comment found that a 40-year-old algorithm called vertex substitution heuristic or VSH (a greedy search algorithm starting from a random initialization) often produces lower error rate than AP with comparable computational time. So they concluded that "...(AP) although a welcome addition to the clustering literature, does not possess the relative advantages over existing procedures that were touted in (the AP paper)." Apparently they were not very happy with AP. Then there was a response from the AP guys, in the same issue of Science. They confirmed the experimental results in the comment, but they also tested some much larger data sets. What they found is that, AP and VSH had comparable error rates over all the data sets, but VSH took significantly more time on larger data sets. On the largest data set that they tested (with 18k data points), AP took 2 hours while VSH took 10 days. So AP is actually much more efficient. Still, I think it would be nice to see how AP compares with many other widely used clustering algorithms on problems of different sizes. From one of the authors' homepage, it seems that they've done some more comprehensive tests, but the result is yet to be published.
What impressed me is that, such an intuitive algorithm is actually derived from a standard inference algorithm on factor graphs. A factor graph can be constructed representing the factorization of the energy function of the clustering as well as the constraint that each cluster center must belong to the cluster. The max-sum algorithm (a message passing algorithm, highly related to the loopy belief propagation algorithm on Bayesian networks) can then be used to find an approximate solution (i.e., clustering) that minimizes the energy function. The original max-sum algorithm has exponential time complexity on this fact graph, but can be simplified to O(n^2), resulting in the AP algorithm.
Some additional story:
One year after the publication of AP in Science, there was a technical comment published in Science, Feb 2008. The authors of the comment found that a 40-year-old algorithm called vertex substitution heuristic or VSH (a greedy search algorithm starting from a random initialization) often produces lower error rate than AP with comparable computational time. So they concluded that "...(AP) although a welcome addition to the clustering literature, does not possess the relative advantages over existing procedures that were touted in (the AP paper)." Apparently they were not very happy with AP. Then there was a response from the AP guys, in the same issue of Science. They confirmed the experimental results in the comment, but they also tested some much larger data sets. What they found is that, AP and VSH had comparable error rates over all the data sets, but VSH took significantly more time on larger data sets. On the largest data set that they tested (with 18k data points), AP took 2 hours while VSH took 10 days. So AP is actually much more efficient. Still, I think it would be nice to see how AP compares with many other widely used clustering algorithms on problems of different sizes. From one of the authors' homepage, it seems that they've done some more comprehensive tests, but the result is yet to be published.
Saturday, November 1, 2008
Learning from positive and unlabeled data
In some machine learning scenario, it's hard or unnatural to get negative training samples. For example, it's easy to find a collection of articles on some topic, but it may be hard to find a collection not on some topic. Another example is, in research community people tend to publish their positive results but not negative results. This leads to the research of one-class classification, which learns a classifier from positive data only. One such technique is one-class SVM. However, one-class classifiers are usually sensitive to parameters, and how to do parameter tuning on them is still largely an open problem (correct me if I'm wrong). You can't tune them by means like cross-validation because you don't have negative data!
So semi-supervised learning comes to rescue. Semi-supervised learning tries to learn a classifier from both labeled and unlabeled data, and unlabeled data is usually easy to get. There have been quite some methods on learning from positive and unlabeled data. A recent one is "Learning Classifiers from Only Positive and Unlabeled Data" (KDD 08). This paper makes the assumption of "selected completely at random", which says that whether a positive sample is labeled or not is independent of the data sample itself. From that they make some nice derivations and basically turn the problem into a traditional classification problem, which is simpler than many of the previous methods. The experimental results show that this method is as good as the best previous work, biased-SVM (which is also much less efficient due to parameter tuning).
So semi-supervised learning comes to rescue. Semi-supervised learning tries to learn a classifier from both labeled and unlabeled data, and unlabeled data is usually easy to get. There have been quite some methods on learning from positive and unlabeled data. A recent one is "Learning Classifiers from Only Positive and Unlabeled Data" (KDD 08). This paper makes the assumption of "selected completely at random", which says that whether a positive sample is labeled or not is independent of the data sample itself. From that they make some nice derivations and basically turn the problem into a traditional classification problem, which is simpler than many of the previous methods. The experimental results show that this method is as good as the best previous work, biased-SVM (which is also much less efficient due to parameter tuning).
Sunday, October 19, 2008
Learn to filter spam
Almost everyday we receive lots of spam emails. Training classifiers to filter spam is one of the most successful real-life applications of machine learning. But spammers are getting smart to counteract spam filters! One trick they use is good word attack. By inserting a set of words that often appear in legitimate messages but not in spams, they make spams look legitimate and confuse the filters (increasing false negative). Even worse, when end users label such emails as spams, adaptive spam filters will learn from these new training samples, and may associate those good words with spam (increasing false positive). These days we do sometimes receive such emails, don't we?
So this paper, "A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters" (JMLR Jun 08), proposed to use multi-instance learning for spam filters. Multi-instance learning learns from and makes predictions on bags of instances, instead of individual instances. If at least one of the instances in a bag is positive, then the bag is positive; otherwise, the bag is negative. Let a bag be an email, and an instance be a part of the email. You see this is a perfect match for good word attack. Now I'm wondering what spammers will do next....
So this paper, "A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters" (JMLR Jun 08), proposed to use multi-instance learning for spam filters. Multi-instance learning learns from and makes predictions on bags of instances, instead of individual instances. If at least one of the instances in a bag is positive, then the bag is positive; otherwise, the bag is negative. Let a bag be an email, and an instance be a part of the email. You see this is a perfect match for good word attack. Now I'm wondering what spammers will do next....
Sunday, October 12, 2008
Feature selection and graphical model structure learning
Feature selection is a technique used in machine learning (classification or regression) to select a subset of the input features so that the problem size is decreased while the learning performance is not compromised. Graphical model structure learning is to learn the graph structure that best represents the independency relations among the random variables of the target probabilistic distribution. At first glance, the two don't seem to be related. But think about it, feature selection tries to find the minimal set of variables that are sufficient to predict the target variable, which is the Markov blanket of the target variable; and finding the Markov blanket is obviously helpful in learning the graphical model structure (and the reverse is also true). This is exactly what this paper "Using Markov Blankets for Causal Structure Learning" in a recent issue of JMLR does.
Btw, the first two posts of this blog are both about learning graphical model structure, but... that's mere coincidence! My current research is only remotely related to this area. I'll talk about some other topic in the next post.
Btw, the first two posts of this blog are both about learning graphical model structure, but... that's mere coincidence! My current research is only remotely related to this area. I'll talk about some other topic in the next post.
Thursday, October 2, 2008
Graphical model structure learning with unreliable tests
Just listened to an interesting talk by Dimitris Margaritis on his work (with Facundo Bromberg) "Improving the Reliability of Causal Discovery from Small Data Sets using Argumentation", which will be published in JMLR. It's a new constraint based method of learning graphical model structures, in the presence of inaccurate independence tests.
Learning graphical model structures is an important but difficult problem in machine learning, in which one has to learn the structure of the graphical model from a set of data generated by the model. There are two types of methods for that, score-based and constraint-based. The former searches for a structure (usually by greedy search) that maximizes some score, e.g. likelihood or posterior. The latter does independence tests among variables, takes the results as constraints, and solve a constraint satisfaction problem.
Traditionally, in the constraint-based method, each independence test removes those candidate structures inconsistent with the test result (according to a set of axioms). This actually assumes the tests to be reliable, which however is mostly untrue in practice. So what can be done when facing unreliable independence tests? This work presented by Margaritis proposes to do multiple tests and find out all the contradictions among the results. There are also a reliability score associated with each test (e.g., p-value). So now we can try to accept and reject testing results to make them consistent as well as optimize the overall reliability. Margaritis actually uses a simple, somewhat greedy method which, in my understanding, may not optimize the reliability, but the experimental results are still better than the traditional methods that don't consider the reliability issue.
Learning graphical model structures is an important but difficult problem in machine learning, in which one has to learn the structure of the graphical model from a set of data generated by the model. There are two types of methods for that, score-based and constraint-based. The former searches for a structure (usually by greedy search) that maximizes some score, e.g. likelihood or posterior. The latter does independence tests among variables, takes the results as constraints, and solve a constraint satisfaction problem.
Traditionally, in the constraint-based method, each independence test removes those candidate structures inconsistent with the test result (according to a set of axioms). This actually assumes the tests to be reliable, which however is mostly untrue in practice. So what can be done when facing unreliable independence tests? This work presented by Margaritis proposes to do multiple tests and find out all the contradictions among the results. There are also a reliability score associated with each test (e.g., p-value). So now we can try to accept and reject testing results to make them consistent as well as optimize the overall reliability. Margaritis actually uses a simple, somewhat greedy method which, in my understanding, may not optimize the reliability, but the experimental results are still better than the traditional methods that don't consider the reliability issue.
Tuesday, September 16, 2008
Monday, September 15, 2008
About this blog
So I'm starting this blog, mainly to share and record what I learn (from conferences, journals, textbooks, etc.) in artificial intelligence (AI), machine learning (ML), natural language processing (NLP), knowledge representation and reasoning (KR), cognitive science (CogSci), and other related fields. Among these areas, I will probably focus more on machine learning, which is my current research area. Another motivation of starting this blog is to make something that gives me some additional push to keep working while at the same time adds some fun.
I'll try to write a new post every one or two weeks, unless I'm too busy in order to meet some hard deadline. If you don't find any update for over a month, chances are I'm again being lazy, so please then kick me some comments to remind me :)
Comments on my posts are more than welcomed, as I believe discussion always leads to better understanding. Also since I'm not a native English speaker, please forgive me in case you find some "Engrish" in my posts, and I'd appreciate it if you can point out any mistake or vagueness.
I'll try to write a new post every one or two weeks, unless I'm too busy in order to meet some hard deadline. If you don't find any update for over a month, chances are I'm again being lazy, so please then kick me some comments to remind me :)
Comments on my posts are more than welcomed, as I believe discussion always leads to better understanding. Also since I'm not a native English speaker, please forgive me in case you find some "Engrish" in my posts, and I'd appreciate it if you can point out any mistake or vagueness.
Subscribe to:
Posts (Atom)