1 / 38

Discretization of continuous attributes

Discretization of continuous attributes. CS583 Spring 2005 Yi Zhang Peng Fan. Round 1. Outline. Introduction Three discretization methods comparison Experimental Results Related Work Summary. Introduction. A discretization algorithm converts continuous features into discrete features.

vermillion
Download Presentation

Discretization of continuous attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discretization of continuous attributes CS583 Spring 2005 Yi Zhang Peng Fan

  2. Round 1

  3. Outline • Introduction • Three discretization methods comparison • Experimental Results • Related Work • Summary

  4. Introduction • A discretization algorithm converts continuous features into discrete features. A good discretization algorithm should not increase the error rate much

  5. Age [5,15] Age [16,24] Age [25,67] Discretization • Divide the range of a continuous (numeric) attribute into intervals • Store only the interval labels • Important for association rules and classification

  6. Motivation • Some algorithms are limited to discrete inputs • Many algorithms discretize as part of the learning algorithm(Decision Tree), could this part be improved? • Efficiency. Continuous features drastically slows the learning algorithms

  7. Classifying Discretization Algorithms Discretization can be classified into two dimensions: • Supervised vs. Unsupervised • Global vs. Local

  8. Supervised vs. Unsupervised Make use of class label • Supervised algorithms: • Entropy-based [Fayyad & Irani 93 and others] • Binning [Holte 93] • Unsupervised algorithms: • Equal width intervals • Equal frequency intervals

  9. Global vs. Local • Global methods(Binning) produce a mesh over the entire n-dimensional continuous instance space • Local methods(Decision tree) produce partitions that are applied to localized region of the instance space.

  10. Equal Interval Width (Binning) • Given the # of k, divide the training-set range into k equal-sized bins • Problems: • Where dose k come from? • Sensitive to outliers

  11. Holte’s OneR Sort the training examples in increasing order according to the value of the numeric attribute values of temperature: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y N Y Y Y N N Y Y Y N Y Y N

  12. Holte’s OneR • Place breakpoints in the sequence whenever the class changes Split on 72 – 2 different classes! Easy fix: move the breakpoint to 73.5 producing a mixed partitioning with “no” as majority class • Take the majority class for each partition

  13. Holte’s OneR • Since it may lead to one bin for only one value  constrain to forms bins of at least some minimum size MIN_INST() • Holte’s empirical analysis Min = 6

  14. Entropy-based Partitioning • Intuitively, find the best split so that the bins are as pure as possible. • Information split: I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2) • Information gain of the split: Gain(v,S) = Entropy(S) – I(S1,S2) Goal: split with maximal information gain • Process is recursively applied to partitions obtained until some stopping criterion is met  creating multiple intervals • This is a supervised method.

  15. Experimental Setup • Compare equal-width (k=10 & K=2*logL), Holte’s 1R(Min=6), and entropy-based. • Compared C4.5 & Naïve-bayes with and without the filter discretizations. • Choose 16 datasets from UCI, all containing at least on continuous feature.

  16. Accuracies using C4.5 with different discretization methods

  17. Accuracies using Naive-Bayes with different discretization methods

  18. Result for C4.5 accuracy difference of pre-discretized data minus original C4.5

  19. Result for Naïve-Bayes accuracy difference of pre-discretized data minus original Naïve-Bayes

  20. Conclusions • The average performance of the entropy-discretized NB is 83.97% Original NB is 76.57% C4.5 is 83.26% Original C4.5 is 82.25% • The supervised learning methods are slightly better • The global entropy-based method seems to be the best choice here

  21. Round 2

  22. Chi-square Based Methods

  23. Chi-square based methods • In discretization problem, a compromise must be found between information quality and statistical quality: • Information quality: Homogeneous intervals in regard to the attribute to predict. • Statistical quality: Sufficient sample size in every interval to ensure generalization. • Entropy-based criteria focus on the information quality, while chi-square criteria focus on the statistical quality. • Examples: • ChiSplit top-down, local • ChiMerge bottom-up, local • Khiops top-down, global

  24. Chi-Square Calculation • Χ2 (chi-square) test • Two main use of chi-square test: • Test independence of two factors: the larger the Χ2 value, the more likely the factors are related • Test similarity of two distributions: the smaller the Χ2 value, the more likely the two distributions are similar

  25. Chi-Square Calculation: An Example Exp(S,A) = 45 * 30 / 150 = 9 ; Exp(S,B) = 45 * 120 / 150 = 36 9 36 21 84 It shows that like_science_fiction and play_computer_game are correlated in the group

  26. ChiSplit(Bertier&Bouroche) • Top-down approach. • It searches for the best split of an interval, by maximizing the chi-square criterion applied to the two sub-intervals adjacent to the splitting point: the interval is split if both sub-intervals substantially differ statistically. • Stopping rule: user-defined chi-square threshold .

  27. ChiMerge (Kerber 1991) • Initialization step • Place each distinct continuous value into its own interval • Bottom-up fashion • Using chi-square test determine when adjacent intervals should be merged • Repeat until a stopping criteria (set manually) is met

  28. Chi-Merge Example 1 2 1 2 0.5 2.5 0.5 2.5 In this case, 17.92 will emerge with 16.92, instead of 18.08

  29. Several Other Discretization Methods

  30. Iterative Discretization (Pazzani&M.J.) • Initially, form a set of intervals using Equal Width Discretization or Entropy-based Method. • Iteratively adjust the intervals to minimize naïve-Bayesian classifiers’ classification error on the training data. • Two adjustment operations: merging or splitting. • Very time-assuming when data are large.

  31. Proportional k-Interval Discretization (Yang&Webb) • Discretization bias and variance: • Bias: errors result from our algorithms. • Variance: errors result from random variation in the training data and random behavior of the algorithms. • Bias and variance trade-off: (when training data are fixed)

  32. Proportional k-Interval Discretization (Contd.) • For N training instances • s*t=N, s=t. (s: desired interval size; t: # of intervals) • Discretize it into sqrt(N) intervals, with sqrt(N) instances in each interval. • Advantage • When N increase, both bias and variance decrease!

  33. Weighted Proportional k-Interval Discretization (Yang&Webb) • When N is small Proportional k-Interval Discretization tends to produce small intervals resulting in high variance. • Ensure a minimum interval size to prevent high variance: • s*t=N, s-m=t. (s: desired interval size; t: # of intervals; m=30)

  34. Atomic interval Non-Disjoint Discretization (Yang&Webb) • In Naïve-Bayesian classification, we have the following estimation after discretization: • When is near either boundary of (a,b], the right side is less likely to provide relevant information about . interval

  35. Experimental Validation • Discretization algorithms: • Equal Width Discretization, Equal Frequency Discretization, Entropy Minimization Discretization, Proportional K-Interval Discretization, Lazy Discretization, Non-Disjoint Discretization, and Weighted Proportional K-Interval Discretization. • Classification algorithm: Naïve-Bayesian. • Results: LD,NDD and WPKD performs better, but LD can’t scale to large data.

  36. Combine NDD and WPKID? • Weighted Non-Disjoint Discretization • Interval size produced by WPKID. • Combining three atomic intervals into one interval.

  37. Thank you, and questions?

  38. References • Supervised and Unsupervised Discretization of Continuous Features (Dougherty,Kohavi&Sahami) • Chimerge : Discretization of numeric attributes (Kerber) • Khiops: A Statistical Discretization Method of Continuous Attributes (Boulle) • Analyse des donn´ees multidimensionnelles (Bertier&Bouroche) • An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers (Pazzani&J.) • Proportional K-Interval Discretization for naïve-Bayes Classifiers (Yang&Webb) • Weighted Proportional K-Interval Discretization for naïve-Bayes Classifiers (Yang&Webb) • Non-Disjoint Discretization for naïve-Bayes Classifiers (Yang&Webb) • A Comparative Study of Discretization Mehods for Naïve-Bayes Classifiers (Yang&Webb)

More Related