Monday, September 27, 2010

Three types of information for learning structured predictor

Structured prediction tries to predict the hidden structure of the input. In other words, it's a prediction problem with multiple outputs that are mutually dependent. An example is sentence parsing, where given an input sentence we want to predict the syntactic parse tree of the sentence. In the classic setting (i.e., supervised structured prediction), the training data is:

1. input data with their hidden structures labeled, e.g., sentences with their syntactic parsing

And of course, we can also have:

2. input data without structure labeling, e.g., just sentences

When both type-1 and type-2 info is available, it's called semi-supervised structured prediction. When only type-2 info is available, it's an unsupervised structured prediction problem, which is usually very hard.

For many structured prediction problems, there is a related binary classification problem, which distinguishes valid inputs from invalid ones. One can view a valid input as the input having at least one valid structure, and an invalid input as the input having no valid structure. In the field of formal grammar, the binary classification problem is associated with the weak generative capacity of a grammar, and the structured prediction problem is associated with the strong generative capacity of a grammar. So we can have a third type of information:

3. invalid inputs, e.g., ungrammatical sentences

The question is, whether this type-3 info can help structured prediction. This is studied in "Structured Output Learning with Indirect Supervision" from ICML 2010. They found that it does help, because invalid inputs further restrict the space of correct structures, as illustrated in their Figure.1.

Now let's turn to the binary classification problem. If both type-2 and type-3 info is available, this is the classic supervised learning; if only type-2 or only type-3 info is available, this is the one-class classification problem. So one would wonder, whether type-1 info would help the binary classification? This is also experimented in the same paper, and interestingly, the answer is again yes.

Wednesday, September 1, 2010

Unsupervised Multilingual Grammar Induction

Grammar induction is the process of learning a grammar (e.g., context-free grammar) from texts. If there's no grammatical annotation of the texts available to the learner, then it's called unsupervised grammar induction. This sounds like (and actually is) a hard problem, especially for complicated grammars like those of natural languages. But every human actually goes through such a procedure in his early life, although many researchers argue that human have more than plain language input when learning the first language, e.g., perceptual context.

There's some recent work on multilingual grammar induction, that is, learning from the texts of multiple languages simultaneously, in hope that through the commonality of different languages, one can get more useful info than available from a monolingual corpus of limited size. The commonality is usually represented by the alignment of parallel corpus, i.e., the mapping of words between sentences of the same meaning but different languages. In ACL2010, however, there happen to be two papers on unsupervised multilingual grammar induction without using alignment.

The first is "Learning Common Grammar from Multilingual Corpus", which learns a probabilistic context-free grammar for each language. The assumption is, each grammar has the same set of nonterminals, and for each nonterminal, the probability distribution is different but has the same Dirichlet prior. One problem with this is the different ordering of grammatical components of different languages is not taken into account. Also there's no quantitative evaluation in the paper.

The second paper, "Phylogenetic Grammar Induction", proposes a method that is more complicated. Here a type of dependency grammar is used. The parameters of grammars of different languages are tied, not by a single common prior, but by a set of priors positioned in a phylogenetic tree. The commonality of different languages is represented by the mapping of their Part-of-Speech tags. They did a more complete evaluation, and found that a more accurate/complete phylogenetic tree resulted in better grammars.

The success of multilingual grammar induction, especially the result of the 2nd paper, has some very interesting implication. One difficulty of unsupervised grammar induction is that, it usually yields an alternative grammar different from the one we use (see error type 3 in my previous post "Errors of Unsupervised Learning"). From the results of these papers, it seems that the set of alternative grammars of each language is more or less different, but they all contains the grammar we use, so multilingual grammar induction can eliminate those alternatives while find the right one. But why? Because of the mysterious universal grammar?