Apprentice of Artificial Intelligence: March 2009

Nonparametric models don't have a fixed number of effective parameters, but instead have it determined by data. This is very useful if we don't know the parameter number in advance. For example, in Gaussian mixture model we have to specify K, the number of Gaussian distributions (components). If we want to use the model to fit some data but don't know the value of K, we have to make a set of models with different values of K, and make our choice based on some criteria like AIC or BIC. With a nonparametric model (more specifically, using Dirichlet process (DP), or equivalently, Chinese restaurant process (CRP) as the prior distribution), we assume an infinite number of components, and do a posterior inference which determines how many of them are actually used. The expected number of effective components is controlled by a hyperparameter.

Because of the popularity of mixture models in AI and machine learning, DP (and CRP) becomes one of the best known nonparametric methods. One limitation of mixture models is that each data point belongs to exactly one component, while in some cases we'd like to associate each data point with multiple labels (or no label at all). For example, an object may have several features, or belong to multiple classes at the same time. To go beyond mixture models, we need a new nonparametric method. The NIPS 06 paper "Infinite latent feature models and the Indian buffet process" by Griffiths and Ghahramani described such a model. Since we can represent the labels of a data point by a binary vector, the labeling of a set of data points can be represented by a matrix. By taking the limit of the distribution over such matrices (actually the equivalence classes of such matrices so that the label order doesn't matter; see their paper) as the number of labels approaches infinity, they get an extension of CRP where each data point can have multiple labels. Just like CRP, they also described a (restaurant-related) stochastic process that samples from the distribution, called Indian buffet process (IBP). Now instead of choosing a table, each customer need to choose buffet dishes, while the number of dishes they can choose is unfixed.

An interesting example is shown in their paper demonstrating how IBP can be used to generate features by learning from a set of images, each containing zero to four objects (with fixed positions however). By running Gibbs sampling on their model with IBP as the prior, they got seven features, where the four most frequent ones correspond exactly to the four objects.

Effective feature selection can be computationally expensive when the feature number is large; and in some cases feature values are expensive to get, so we don't want to examine every feature with the traditional feature selection methods (see budgeted learning). This paper, "Learning to Select Features using their Properties" (JMLR 08), proposes a simple idea to such problems, that they learn to predict good features based on a set of meta-features. Here meta-features are a set of properties of the features. For example, in object recognition people sometimes use the presence of a set of prototype patterns as features, then the meta-features could be the statistics of such prototype patterns. So in their method, they first evaluate only a subset of features using traditional feature selection methods, and then use the result to train a predictor that can be used to select unmeasured features. In their experiments they show their method achieved classification accuracy similar to the traditional methods but was much faster. Unfortunately meta-features have to be specified manually, but nevertheless in many cases it's much more obvious than selecting features.

Apprentice of Artificial Intelligence

Wednesday, March 25, 2009

Indian Buffet Process

Friday, March 13, 2009

Learning to select features

Labels

Blog Archive

Guestbook

About Me