Comparison of Discretization Algorithms in Machine Learning

Discretization of continuous attributes CS583 Spring 2005 Yi Zhang Peng Fan

Round 1

Outline • Introduction • Three discretization methods comparison • Experimental Results • Related Work • Summary

Introduction • A discretization algorithm converts continuous features into discrete features. A good discretization algorithm should not increase the error rate much

Age [5,15] Age [16,24] Age [25,67] Discretization • Divide the range of a continuous (numeric) attribute into intervals • Store only the interval labels • Important for association rules and classification

Motivation • Some algorithms are limited to discrete inputs • Many algorithms discretize as part of the learning algorithm(Decision Tree), could this part be improved? • Efficiency. Continuous features drastically slows the learning algorithms

Classifying Discretization Algorithms Discretization can be classified into two dimensions: • Supervised vs. Unsupervised • Global vs. Local

Supervised vs. Unsupervised Make use of class label • Supervised algorithms: • Entropy-based [Fayyad & Irani 93 and others] • Binning [Holte 93] • Unsupervised algorithms: • Equal width intervals • Equal frequency intervals

Global vs. Local • Global methods(Binning) produce a mesh over the entire n-dimensional continuous instance space • Local methods(Decision tree) produce partitions that are applied to localized region of the instance space.

Equal Interval Width (Binning) • Given the # of k, divide the training-set range into k equal-sized bins • Problems: • Where dose k come from? • Sensitive to outliers

Holte’s OneR Sort the training examples in increasing order according to the value of the numeric attribute values of temperature: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y N Y Y Y N N Y Y Y N Y Y N

Holte’s OneR • Place breakpoints in the sequence whenever the class changes Split on 72 – 2 different classes! Easy fix: move the breakpoint to 73.5 producing a mixed partitioning with “no” as majority class • Take the majority class for each partition

Holte’s OneR • Since it may lead to one bin for only one value  constrain to forms bins of at least some minimum size MIN_INST() • Holte’s empirical analysis Min = 6

Entropy-based Partitioning • Intuitively, find the best split so that the bins are as pure as possible. • Information split: I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2) • Information gain of the split: Gain(v,S) = Entropy(S) – I(S1,S2) Goal: split with maximal information gain • Process is recursively applied to partitions obtained until some stopping criterion is met  creating multiple intervals • This is a supervised method.

Experimental Setup • Compare equal-width (k=10 & K=2*logL), Holte’s 1R(Min=6), and entropy-based. • Compared C4.5 & Naïve-bayes with and without the filter discretizations. • Choose 16 datasets from UCI, all containing at least on continuous feature.

Accuracies using C4.5 with different discretization methods

Accuracies using Naive-Bayes with different discretization methods

Result for C4.5 accuracy difference of pre-discretized data minus original C4.5

Result for Naïve-Bayes accuracy difference of pre-discretized data minus original Naïve-Bayes

Conclusions • The average performance of the entropy-discretized NB is 83.97% Original NB is 76.57% C4.5 is 83.26% Original C4.5 is 82.25% • The supervised learning methods are slightly better • The global entropy-based method seems to be the best choice here

Round 2

Chi-square Based Methods

Chi-square based methods • In discretization problem, a compromise must be found between information quality and statistical quality: • Information quality: Homogeneous intervals in regard to the attribute to predict. • Statistical quality: Sufficient sample size in every interval to ensure generalization. • Entropy-based criteria focus on the information quality, while chi-square criteria focus on the statistical quality. • Examples: • ChiSplit top-down, local • ChiMerge bottom-up, local • Khiops top-down, global

Chi-Square Calculation • Χ2 (chi-square) test • Two main use of chi-square test: • Test independence of two factors: the larger the Χ2 value, the more likely the factors are related • Test similarity of two distributions: the smaller the Χ2 value, the more likely the two distributions are similar

Chi-Square Calculation: An Example Exp(S,A) = 45 * 30 / 150 = 9 ; Exp(S,B) = 45 * 120 / 150 = 36 9 36 21 84 It shows that like_science_fiction and play_computer_game are correlated in the group

ChiSplit(Bertier&Bouroche) • Top-down approach. • It searches for the best split of an interval, by maximizing the chi-square criterion applied to the two sub-intervals adjacent to the splitting point: the interval is split if both sub-intervals substantially differ statistically. • Stopping rule: user-defined chi-square threshold .

ChiMerge (Kerber 1991) • Initialization step • Place each distinct continuous value into its own interval • Bottom-up fashion • Using chi-square test determine when adjacent intervals should be merged • Repeat until a stopping criteria (set manually) is met

Chi-Merge Example 1 2 1 2 0.5 2.5 0.5 2.5 In this case, 17.92 will emerge with 16.92, instead of 18.08

Several Other Discretization Methods

Iterative Discretization (Pazzani&M.J.) • Initially, form a set of intervals using Equal Width Discretization or Entropy-based Method. • Iteratively adjust the intervals to minimize naïve-Bayesian classifiers’ classification error on the training data. • Two adjustment operations: merging or splitting. • Very time-assuming when data are large.

Proportional k-Interval Discretization (Yang&Webb) • Discretization bias and variance: • Bias: errors result from our algorithms. • Variance: errors result from random variation in the training data and random behavior of the algorithms. • Bias and variance trade-off: (when training data are fixed)

Proportional k-Interval Discretization (Contd.) • For N training instances • s*t=N, s=t. (s: desired interval size; t: # of intervals) • Discretize it into sqrt(N) intervals, with sqrt(N) instances in each interval. • Advantage • When N increase, both bias and variance decrease!

Weighted Proportional k-Interval Discretization (Yang&Webb) • When N is small Proportional k-Interval Discretization tends to produce small intervals resulting in high variance. • Ensure a minimum interval size to prevent high variance: • s*t=N, s-m=t. (s: desired interval size; t: # of intervals; m=30)

Atomic interval Non-Disjoint Discretization (Yang&Webb) • In Naïve-Bayesian classification, we have the following estimation after discretization: • When is near either boundary of (a,b], the right side is less likely to provide relevant information about . interval

Experimental Validation • Discretization algorithms: • Equal Width Discretization, Equal Frequency Discretization, Entropy Minimization Discretization, Proportional K-Interval Discretization, Lazy Discretization, Non-Disjoint Discretization, and Weighted Proportional K-Interval Discretization. • Classification algorithm: Naïve-Bayesian. • Results: LD,NDD and WPKD performs better, but LD can’t scale to large data.

Combine NDD and WPKID? • Weighted Non-Disjoint Discretization • Interval size produced by WPKID. • Combining three atomic intervals into one interval.

Thank you, and questions?

References • Supervised and Unsupervised Discretization of Continuous Features (Dougherty,Kohavi&Sahami) • Chimerge : Discretization of numeric attributes (Kerber) • Khiops: A Statistical Discretization Method of Continuous Attributes (Boulle) • Analyse des donn´ees multidimensionnelles (Bertier&Bouroche) • An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers (Pazzani&J.) • Proportional K-Interval Discretization for naïve-Bayes Classifiers (Yang&Webb) • Weighted Proportional K-Interval Discretization for naïve-Bayes Classifiers (Yang&Webb) • Non-Disjoint Discretization for naïve-Bayes Classifiers (Yang&Webb) • A Comparative Study of Discretization Mehods for Naïve-Bayes Classifiers (Yang&Webb)

Comparison of Discretization Algorithms in Machine Learning

Comparison of Discretization Algorithms in Machine Learning

Presentation Transcript

Attributes of Color

Discretization for PDEs

Attributes of Attention:

A dynamic-programming algorithm for hierarchical discretization of continuous attributes

Discretization for PDEs

Global Disclosure Risk for Microdata with Continuous Attributes

Attributes of Information

Attributes of Triangles

Data Discretization Unification

Discretization

Continuous Attributes: Computing GINI Index / 2

Lecture 1 Discretization of energies

Packet implementation: discretization

Attributes of Quadrilaterals

Discretization in RSES

P DE’s Discretization

Discretization Methods

Discretization of Fluid Models ( Navier Stokes)

P DE’s Discretization

ChiMerge Discretization

Discretization

Discretization for PDEs