Thursday, July 22, 2010

Errors of Unsupervised Learning

The paper "Analyzing the Errors of Unsupervised Learning" (ACL08) decomposes the errors of unsupervised learning into four types.

1. Optimization error. In most cases we cannot find the global optimum of the objective function, and we resort to methods like EM or gradient descent to obtain a local optimum. This seems to me the most recognized error.

2. Estimation error. We only have a finite training set , from which we cannot really uncover the true distribution. Obviously, this error can be reduced by increasing the size of the training set. The paper also found that, larger training set can help reduce the 1st error as well, possibly by making the objective function smoother.

3. Identi fiability error. There are hidden variables (e.g., labels) in unsupervised learning. So there could be a few different models that have the same maximal objective function value (over observables, e.g., likelihood of observables), but they may have different distance to the true distribution (over both observable and hidden variables).

4. Approximation error. This is the discrepancy between the best model that maximizes the objective function and the true model. In my opinion this can be further decomposed into two types of errors:
4.1. The model family we learn may not include the true model. For example, using hidden Markov model to fit the data sampled from a probabilistic context-free grammar.
4.2. The objective function we use may not find the best model in the model family. For example, if the objective function is the posterior given the data, then different priors would lead to different models, and the better the prior reflects the true model, the better the resulting model is.