Wednesday, April 15, 2009

Discretization for naive Bayes

When using naive Bayes, quantitative attributes are usually discretized (binning). Two simple ways of discretization are partitioning the data into k intervals of equal width, or into k intervals so each contains equal number of data points. There are also supervised methods, like the one used in Weka discretizer, which recursively selects the cut-point based on information gain (just like in some decision tree construction algorithms). In the paper "Discretization for naive-Bayes learning: managing discretization bias and variance" (MLJ, Jan 09), the authors find out that, increasing the number of intervals helps reduce the bias of the learned NB classifier, while the number of data points within each interval (called "interval frequency") shall be large enough to reduce the variance of the learned classifier. Therefore, when we have a fixed size of training data, there's a trade-off between the interval number and frequency. Based on this observation they propose two novel discretization methods. "Proportional discretization" requires the interval number and frequency to be the same; "fixed frequency discretization" simply fixes the interval frequency to a sufficiently large value. Their experiments showed that these two methods outperformed existing discretization techniques.

No comments: