- Rethinking LDA: Why Priors Matter. A second Dirichlet prior is added on top of the Dirichlet prior of the topic distributions in LDA, instead of fixed symmetric hyperparameters. This makes the model much more robust. For efficiency consideration, point estimation can be used on the parameters of the first Dirichlet prior.
- Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process. A Dirichlet prior (with small parameters) has two effects on the posterior of the discrete distribution (e.g., the topic distribution in LDA): sparsity and smoothness. The two are controlled by the same concentration parameter. To decouple the two, this paper adds a set of Bernoulli variables to the model.
- Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations. An application of the beta process (i.e., the Indian buffet process) to computer vision for image dictionary learning, which can be extended for classification, denoising and inpainting, etc.
- Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning. Discriminative and generative languages learning differ in the use of the "implicit negative evidence" i.e., the absence of sentences. Psychological experiments show that human are capable of both, depending on how the task is presented.
- Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs. In problems like grammar induction, dynamic programming is usually used to compute expectations, which can be very slow on large data. Pruning can accelerate the computation but introduces bias. Here is a new technique that uses MCMC.
- Reading Tea Leaves: How Humans Interpret Topic Models. A human evaluation of the quality of the topics learned by topic models. They actually used Amazon Mechanical Turk to conduct the evaluation. One interesting result is that the held-out likelihood (or perplexity) may have a negative correlation with the measured quality of the topics.
- Posterior vs Parameter Sparsity in Latent Variable Models. A novel kind of sparsity bias that uses the posterior plus a regularization term as the object function, which is optimized by an EM-like algorithm. The regularization term can represent sparsity bias that can't be easily expressed as priors of model parameters (e.g., the Dirichlet prior).
- An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism. An extension of Indian buffet process that adds a second (or more) layer of features, which are connected to the first layer by noisy-or, i.e., a low level feature is the noisy-or of a few high level features. This new model learns more compact features and has better performance in various tasks than IBP.
Monday, December 21, 2009
NIPS 2009 My List
Here is a list of NIPS2009 papers that are interesting to me.
Thursday, September 17, 2009
Curriculum Learning
Most machine learning algorithms look at their training data all at once. However, human usually learn in a different, more organized way. We start with simple cases, concepts, or skills, and then gradually increase the difficulty level of the learning. Can this learning strategy also apply to machines? This is what the ICML-09 paper "Curriculum Learning" tries to address. In their three experiments on toy classification problems, shape recognition and language modeling, curriculum learning was shown to be both faster and more accurate than learning without curriculum. Some important questions regarding curriculum learning are discussed but remain largely open. First of all, why does curriculum learning help (or in some cases, doesn't help)? One speculation is that at the beginning of learning, difficult samples can confuse the learner instead of help. A related question is, how shall we find, either for a specific problem or in general, a good curriculum? In their experiments they used some quite intuitive curriculum; but an intuitive curriculum doesn't always help (for example, see Hal Daume's comment on this paper in his blog).
Saturday, June 6, 2009
Convergence of Loopy Belief Propagation
I've been using loopy belief propagation (LBP) for inference recently. LBP is exact on tree-structures; but on graphs with cycles it is approximate and may not even converge. Generally speaking, the less cycles the graph contains and more smooth the factors are, the more likely LBP will converge. Unfortunately, the graphical model that I'm working on is a complete graph; even worse, part of the factors defined over the graph encode hard constraints. In addition, I'm using the max-sum version of LBP, which is thought to be more prone to oscillation than the sum-product version. The combined effect of all these was a nightmare of oscillation, even in simple cases.
There doesn't seem to be much work done on how to help LBP converge, and here is what I've found and tried.
With this experience, I suspect that the convergence problem is one of the reasons that LBP is less used for graphical model inference than MCMC and variational methods. For the latter two, in general we would know in advance whether we can use them or not; but for LBP, in many cases running it seems the only way to know if it works or not.
There doesn't seem to be much work done on how to help LBP converge, and here is what I've found and tried.
- Damping. That is, when updating a message, set it to a weighted average of its old values and new values. This somewhat slows down the pace of LBP and avoids possible overshooting.
- Using a more informed message passing schedule. Instead of updating all the messages synchronously or in a fixed order, we can choose the next message according to some criterion. A very simple yet powerful schedule is called "residual belief propagation" (RBP), which always updates the message that has the most change (residue). An improved (but maybe less general) version of RBP is RBP0L, which estimates the residues instead of actually computing them.
With this experience, I suspect that the convergence problem is one of the reasons that LBP is less used for graphical model inference than MCMC and variational methods. For the latter two, in general we would know in advance whether we can use them or not; but for LBP, in many cases running it seems the only way to know if it works or not.
Wednesday, April 15, 2009
Discretization for naive Bayes
When using naive Bayes, quantitative attributes are usually discretized (binning). Two simple ways of discretization are partitioning the data into k intervals of equal width, or into k intervals so each contains equal number of data points. There are also supervised methods, like the one used in Weka discretizer, which recursively selects the cut-point based on information gain (just like in some decision tree construction algorithms). In the paper "Discretization for naive-Bayes learning: managing discretization bias and variance" (MLJ, Jan 09), the authors find out that, increasing the number of intervals helps reduce the bias of the learned NB classifier, while the number of data points within each interval (called "interval frequency") shall be large enough to reduce the variance of the learned classifier. Therefore, when we have a fixed size of training data, there's a trade-off between the interval number and frequency. Based on this observation they propose two novel discretization methods. "Proportional discretization" requires the interval number and frequency to be the same; "fixed frequency discretization" simply fixes the interval frequency to a sufficiently large value. Their experiments showed that these two methods outperformed existing discretization techniques.
Thursday, April 9, 2009
More on Indian buffet process
I had my lunch today in an Indian buffet for the first time. Since I knew nothing of Indian food, I decided to follow the Indian buffet process, as described in my last post. That is, I chose dishes with probabilities proportional to their popularity, but also chose some random new dishes. The result was good. That said, however, I still don't think I have enough courage to test the Chinese restaurant process in a Chinese restaurant :)
Back to mathematics. Remember that the Chinese restaurant process actually produces samples according to a distribution that is sampled from a Dirichlet process; and the stick-breaking construction is another way to sample from the Dirichlet process. It turns out that there exist counterparts for the Indian buffet process (IBP). The paper "Hierarchical beta processes and the Indian buffet process" (AISTATS 07) shows that for IBP the beta process is the counterpart of the Dirichlet process; and the paper "Stick-breaking Construction for the Indian Buffet Process" (also in AISTATS 07) shows a stick-breaking construction for IBP.
Back to mathematics. Remember that the Chinese restaurant process actually produces samples according to a distribution that is sampled from a Dirichlet process; and the stick-breaking construction is another way to sample from the Dirichlet process. It turns out that there exist counterparts for the Indian buffet process (IBP). The paper "Hierarchical beta processes and the Indian buffet process" (AISTATS 07) shows that for IBP the beta process is the counterpart of the Dirichlet process; and the paper "Stick-breaking Construction for the Indian Buffet Process" (also in AISTATS 07) shows a stick-breaking construction for IBP.
Wednesday, March 25, 2009
Indian Buffet Process
Nonparametric models don't have a fixed number of effective parameters, but instead have it determined by data. This is very useful if we don't know the parameter number in advance. For example, in Gaussian mixture model we have to specify K, the number of Gaussian distributions (components). If we want to use the model to fit some data but don't know the value of K, we have to make a set of models with different values of K, and make our choice based on some criteria like AIC or BIC. With a nonparametric model (more specifically, using Dirichlet process (DP), or equivalently, Chinese restaurant process (CRP) as the prior distribution), we assume an infinite number of components, and do a posterior inference which determines how many of them are actually used. The expected number of effective components is controlled by a hyperparameter.
Because of the popularity of mixture models in AI and machine learning, DP (and CRP) becomes one of the best known nonparametric methods. One limitation of mixture models is that each data point belongs to exactly one component, while in some cases we'd like to associate each data point with multiple labels (or no label at all). For example, an object may have several features, or belong to multiple classes at the same time. To go beyond mixture models, we need a new nonparametric method. The NIPS 06 paper "Infinite latent feature models and the Indian buffet process" by Griffiths and Ghahramani described such a model. Since we can represent the labels of a data point by a binary vector, the labeling of a set of data points can be represented by a matrix. By taking the limit of the distribution over such matrices (actually the equivalence classes of such matrices so that the label order doesn't matter; see their paper) as the number of labels approaches infinity, they get an extension of CRP where each data point can have multiple labels. Just like CRP, they also described a (restaurant-related) stochastic process that samples from the distribution, called Indian buffet process (IBP). Now instead of choosing a table, each customer need to choose buffet dishes, while the number of dishes they can choose is unfixed.
An interesting example is shown in their paper demonstrating how IBP can be used to generate features by learning from a set of images, each containing zero to four objects (with fixed positions however). By running Gibbs sampling on their model with IBP as the prior, they got seven features, where the four most frequent ones correspond exactly to the four objects.
Because of the popularity of mixture models in AI and machine learning, DP (and CRP) becomes one of the best known nonparametric methods. One limitation of mixture models is that each data point belongs to exactly one component, while in some cases we'd like to associate each data point with multiple labels (or no label at all). For example, an object may have several features, or belong to multiple classes at the same time. To go beyond mixture models, we need a new nonparametric method. The NIPS 06 paper "Infinite latent feature models and the Indian buffet process" by Griffiths and Ghahramani described such a model. Since we can represent the labels of a data point by a binary vector, the labeling of a set of data points can be represented by a matrix. By taking the limit of the distribution over such matrices (actually the equivalence classes of such matrices so that the label order doesn't matter; see their paper) as the number of labels approaches infinity, they get an extension of CRP where each data point can have multiple labels. Just like CRP, they also described a (restaurant-related) stochastic process that samples from the distribution, called Indian buffet process (IBP). Now instead of choosing a table, each customer need to choose buffet dishes, while the number of dishes they can choose is unfixed.
An interesting example is shown in their paper demonstrating how IBP can be used to generate features by learning from a set of images, each containing zero to four objects (with fixed positions however). By running Gibbs sampling on their model with IBP as the prior, they got seven features, where the four most frequent ones correspond exactly to the four objects.
Friday, March 13, 2009
Learning to select features
Effective feature selection can be computationally expensive when the feature number is large; and in some cases feature values are expensive to get, so we don't want to examine every feature with the traditional feature selection methods (see budgeted learning). This paper, "Learning to Select Features using their Properties" (JMLR 08), proposes a simple idea to such problems, that they learn to predict good features based on a set of meta-features. Here meta-features are a set of properties of the features. For example, in object recognition people sometimes use the presence of a set of prototype patterns as features, then the meta-features could be the statistics of such prototype patterns. So in their method, they first evaluate only a subset of features using traditional feature selection methods, and then use the result to train a predictor that can be used to select unmeasured features. In their experiments they show their method achieved classification accuracy similar to the traditional methods but was much faster. Unfortunately meta-features have to be specified manually, but nevertheless in many cases it's much more obvious than selecting features.
Monday, January 5, 2009
Modeling online sentence parsing by particle filter
This is an interesting paper from NIPS2008: modeling the effects of memory on human online sentence processing with particle filters. When we read or listen to a sentence, we receive the words incrementally (i.e., one after another), and construct a mental comprehension of the sentence. Dynamic programming can be used to parse a sentence incrementally, but since it can always find the right parsing, it can't explain why people may fail to comprehend "garden-path sentences" in their first attempt. For example, when reading the sentence "the horse raced past the barn fell", we may fail at the last word because we are likely to take "raced" as the main verb before reading the last word "fell". Previous work used pruning to model such an effect: only a set of high-probability partial parsings are kept after receiving each word, so the correct parsing may be dropped halfway. This paper adopts a different idea: in a resource-bounded way, estimating the posterior of partial parsings given the words that have been received. So particle filter becomes a natural choice, where each word is an observation and the partial parsings are the hidden states. Again, the right parsing may be dropped halfway because only a set of particles are maintained, which explains the garden-path effect. In addition, this method can explain the "digging-in effect", which says a longer garden-path sentence is harder to comprehend than a shorter one. For example, compare this sentence with the previous one: "the horse raced past the barn that is big and old fell". The explanation is, with more words in the sentence before the disambiguation point, the particles for the right parsing are more likely to be dropped due to resampling.
Subscribe to:
Posts (Atom)