Monday, September 27, 2010

Three types of information for learning structured predictor

Structured prediction tries to predict the hidden structure of the input. In other words, it's a prediction problem with multiple outputs that are mutually dependent. An example is sentence parsing, where given an input sentence we want to predict the syntactic parse tree of the sentence. In the classic setting (i.e., supervised structured prediction), the training data is:

1. input data with their hidden structures labeled, e.g., sentences with their syntactic parsing

And of course, we can also have:

2. input data without structure labeling, e.g., just sentences

When both type-1 and type-2 info is available, it's called semi-supervised structured prediction. When only type-2 info is available, it's an unsupervised structured prediction problem, which is usually very hard.

For many structured prediction problems, there is a related binary classification problem, which distinguishes valid inputs from invalid ones. One can view a valid input as the input having at least one valid structure, and an invalid input as the input having no valid structure. In the field of formal grammar, the binary classification problem is associated with the weak generative capacity of a grammar, and the structured prediction problem is associated with the strong generative capacity of a grammar. So we can have a third type of information:

3. invalid inputs, e.g., ungrammatical sentences

The question is, whether this type-3 info can help structured prediction. This is studied in "Structured Output Learning with Indirect Supervision" from ICML 2010. They found that it does help, because invalid inputs further restrict the space of correct structures, as illustrated in their Figure.1.

Now let's turn to the binary classification problem. If both type-2 and type-3 info is available, this is the classic supervised learning; if only type-2 or only type-3 info is available, this is the one-class classification problem. So one would wonder, whether type-1 info would help the binary classification? This is also experimented in the same paper, and interestingly, the answer is again yes.

Wednesday, September 1, 2010

Unsupervised Multilingual Grammar Induction

Grammar induction is the process of learning a grammar (e.g., context-free grammar) from texts. If there's no grammatical annotation of the texts available to the learner, then it's called unsupervised grammar induction. This sounds like (and actually is) a hard problem, especially for complicated grammars like those of natural languages. But every human actually goes through such a procedure in his early life, although many researchers argue that human have more than plain language input when learning the first language, e.g., perceptual context.

There's some recent work on multilingual grammar induction, that is, learning from the texts of multiple languages simultaneously, in hope that through the commonality of different languages, one can get more useful info than available from a monolingual corpus of limited size. The commonality is usually represented by the alignment of parallel corpus, i.e., the mapping of words between sentences of the same meaning but different languages. In ACL2010, however, there happen to be two papers on unsupervised multilingual grammar induction without using alignment.

The first is "Learning Common Grammar from Multilingual Corpus", which learns a probabilistic context-free grammar for each language. The assumption is, each grammar has the same set of nonterminals, and for each nonterminal, the probability distribution is different but has the same Dirichlet prior. One problem with this is the different ordering of grammatical components of different languages is not taken into account. Also there's no quantitative evaluation in the paper.

The second paper, "Phylogenetic Grammar Induction", proposes a method that is more complicated. Here a type of dependency grammar is used. The parameters of grammars of different languages are tied, not by a single common prior, but by a set of priors positioned in a phylogenetic tree. The commonality of different languages is represented by the mapping of their Part-of-Speech tags. They did a more complete evaluation, and found that a more accurate/complete phylogenetic tree resulted in better grammars.

The success of multilingual grammar induction, especially the result of the 2nd paper, has some very interesting implication. One difficulty of unsupervised grammar induction is that, it usually yields an alternative grammar different from the one we use (see error type 3 in my previous post "Errors of Unsupervised Learning"). From the results of these papers, it seems that the set of alternative grammars of each language is more or less different, but they all contains the grammar we use, so multilingual grammar induction can eliminate those alternatives while find the right one. But why? Because of the mysterious universal grammar?

Wednesday, August 25, 2010

Graphical model structure learning using logistic regression

It seems to me that, one of the reasons that learning graphical model structure is hard is that it has to be optimized in a discrete space. In addition, when using score based methods for structure learning (see my old post for some intro), one has to do inference over parameters separately from the search of structures, because a parameter set is specific to a structure.

One strategy that circumvents these problems is to encode the structure in parameters. That is, we assume a fully connected graph, and the parameters will indicate which edges are effective. The prior on structures can now be added into the prior over parameters. For example, L0 or L1 regularization of parameters can enforce structure sparsity.

One problem with this strategy is that, in the case of discrete variables, a fully connected graph means we have exponentially many parameters. To avoid this problem, one can use micro-models for CPT, e.g., noisy-OR, logistic regression, etc. The L1-regularized logistic regression is a well-studied problem, which can be optimized efficiently.

It turns out that similar ideas have already been studied in recent years. For example, this paper from NIPS 2006 used L1-regularized logistic regression to learn the structure of discrete Markov network. The case of learning Bayesian network structure is more difficult, because there's no intuitive way to represent edge directions in parameters while avoiding cycles. One solution is to first learn the skeleton of the Bayesian network using the above idea, and then determine the edge directions, just like in constraint based structure learning methods. Another solution is to assume the ordering of variables, which uniquely determines the edge directions. An order search can be carried out to find a good ordering. These two methods are discussed in this paper from AAAI 2007.

Friday, August 6, 2010

Mind Reading

Decoding neural activities using machine learning methods is an emerging area since a few years ago. The neural data is usually obtained by presenting the word and/or image of a concept to an experiment participant and recording his brain images (e.g., fMRI, EEG). The types of concepts tested so far are very simple (e.g., concrete nouns, and more recently adjective-noun compositions), but I believe experiments on more complex and abstract concepts are to be expected in the near future (or are already on progress!). Given the neural imaging data, one natural task is to find out the mapping between concepts and images. An intermediate layer of semantic features can be added between concepts and images, which is intuitive and also makes things more tractable. So now the problems are what the right semantic features are, and how to find out the mappings between these layers.

For the first problem, in earlier work this is somewhat manually constructed. In "Predicting Human Brain Activity Associated with the Meanings of Nouns" (Science, May 2008), the semantic features are the co-occurrence frequency of the stimulus noun with 25 manually selected verbs in a large corpus. In a more recent paper, "A Neurosemantic Theory of Concrete Noun Representation Based on the Underlying Brain Codes" (PLoS ONE, Jan 2010), these features are discovered from the fMRI data by means of factor analysis (and the result is very interesting: the three main features are related to manipulation, shelter and eating, all of which are the most important things for the survival of our primitive ancestors). With the semantic features specified, the second problem can be done by simply applying common machine learning predictors like Naive Bayes.

Thursday, July 22, 2010

Errors of Unsupervised Learning

The paper "Analyzing the Errors of Unsupervised Learning" (ACL08) decomposes the errors of unsupervised learning into four types.

1. Optimization error. In most cases we cannot find the global optimum of the objective function, and we resort to methods like EM or gradient descent to obtain a local optimum. This seems to me the most recognized error.

2. Estimation error. We only have a finite training set , from which we cannot really uncover the true distribution. Obviously, this error can be reduced by increasing the size of the training set. The paper also found that, larger training set can help reduce the 1st error as well, possibly by making the objective function smoother.

3. Identi fiability error. There are hidden variables (e.g., labels) in unsupervised learning. So there could be a few different models that have the same maximal objective function value (over observables, e.g., likelihood of observables), but they may have different distance to the true distribution (over both observable and hidden variables).

4. Approximation error. This is the discrepancy between the best model that maximizes the objective function and the true model. In my opinion this can be further decomposed into two types of errors:
4.1. The model family we learn may not include the true model. For example, using hidden Markov model to fit the data sampled from a probabilistic context-free grammar.
4.2. The objective function we use may not find the best model in the model family. For example, if the objective function is the posterior given the data, then different priors would lead to different models, and the better the prior reflects the true model, the better the resulting model is.

Monday, March 1, 2010

Measuring Model Complexity

In model selection and model averaging, it is important to have a good measurement of model complexity. With two models that fit the data equally well, the simpler model should be preferred (the Occam's razor principle). One reason is that usually the simpler model has better generalizability compared with the more complex one. However, it's not clear what the best way is to measure model complexity.

The paper "Measuring Model Complexity with the Prior Predictive" (NIPS 2009) lists a few aspects of a model that a good complexity measure should take into account.
  • the number of parameters
  • the functional form (the way the parameters are combined in the model equation)
  • the parameter range
  • the prior distribution of the parameters
Therefore the complexity measures used in traditional criteria like AIC and BIC is not satisfying. This paper proposes a measure called prior predictive complexity (PPC). Remember that a model defines a distribution over all possible outcomes (i.e., observables). So PPC is defined to be the range of outcomes that the model puts most of its probability mass on (say, 95%), normalized by the total range of outcomes. Since all four aspects of a model listed above influence the distribution over outcomes, PPC is sensitive to all of them. A natural alternative measure, in my opinion, would be to use the information entropy of the distribution over outcomes defined by the model, so we don't have to artificially specify a threshold like 95%.