Wednesday, September 1, 2010

Unsupervised Multilingual Grammar Induction

Grammar induction is the process of learning a grammar (e.g., context-free grammar) from texts. If there's no grammatical annotation of the texts available to the learner, then it's called unsupervised grammar induction. This sounds like (and actually is) a hard problem, especially for complicated grammars like those of natural languages. But every human actually goes through such a procedure in his early life, although many researchers argue that human have more than plain language input when learning the first language, e.g., perceptual context.

There's some recent work on multilingual grammar induction, that is, learning from the texts of multiple languages simultaneously, in hope that through the commonality of different languages, one can get more useful info than available from a monolingual corpus of limited size. The commonality is usually represented by the alignment of parallel corpus, i.e., the mapping of words between sentences of the same meaning but different languages. In ACL2010, however, there happen to be two papers on unsupervised multilingual grammar induction without using alignment.

The first is "Learning Common Grammar from Multilingual Corpus", which learns a probabilistic context-free grammar for each language. The assumption is, each grammar has the same set of nonterminals, and for each nonterminal, the probability distribution is different but has the same Dirichlet prior. One problem with this is the different ordering of grammatical components of different languages is not taken into account. Also there's no quantitative evaluation in the paper.

The second paper, "Phylogenetic Grammar Induction", proposes a method that is more complicated. Here a type of dependency grammar is used. The parameters of grammars of different languages are tied, not by a single common prior, but by a set of priors positioned in a phylogenetic tree. The commonality of different languages is represented by the mapping of their Part-of-Speech tags. They did a more complete evaluation, and found that a more accurate/complete phylogenetic tree resulted in better grammars.

The success of multilingual grammar induction, especially the result of the 2nd paper, has some very interesting implication. One difficulty of unsupervised grammar induction is that, it usually yields an alternative grammar different from the one we use (see error type 3 in my previous post "Errors of Unsupervised Learning"). From the results of these papers, it seems that the set of alternative grammars of each language is more or less different, but they all contains the grammar we use, so multilingual grammar induction can eliminate those alternatives while find the right one. But why? Because of the mysterious universal grammar?

No comments: