1 / 21

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT. Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 5: Discretisation methods. Outline of the lecture. Definition and taxonomy Static discretisation techniques

durin
Download Presentation

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 5: Discretisation methods

  2. Outline of the lecture • Definition and taxonomy • Static discretisation techniques • The ADI representation: a dynamic, local, discretisation method • Resources

  3. Definition • Discretisation: • Process of converting a continuous variable into a discrete one. That is, partitioning it into a finite (discrete) set of elements (intervals) • Every data point inside one interval will be treated equally I1 I2 I3

  4. Definition • How? • A discretisation algorithm proposes a series of cut-points. These, plus the domains bounds define the intervals • Why? • Many methods in mathematics and computer science cannot deal with continuous variables • A discretisation process is required to be able to use them, despite the loss of information

  5. Taxonomy • (Liu et al, 2002) • Supervised vs Non supervised • Supervised methods use the class label of each instances to decide • Dynamic vs Static • Dynamic discretisation is performed at the same time as the learning process. Static discretisation is performed before learning • Rule-based systems that generate arbitrary intervals can be considered a case of dynamic discretisation

  6. Taxonomy • Global vs local • Global methods apply the same discretisation criteria (cut-points) to all instance • Local methods use different cut points for different groups of instances • Again, rule-learning methods that generate arbitrary intervals can be considered a case of local discretisation

  7. Taxonomy • Splitting vs Merging methods • Splitting methods: Start with a single interval (no cut-point) and divide it • Merging methods: Start with every possible interval (n-1 cut points) and merge some of them Splitting Merging

  8. Equal-length discretisation • Unsupervised classification • Given a domain with dl and du bounds and a number of bins b • It will generate b intervals of equal size (du-dl)/b

  9. Equal-frequency discretisation • Unsupervised classification • Given a domain with dl and du bounds and a number of bins b • It will generate b intervals, each of them containing the same number of values

  10. ID3 discretisation (Quinlan, 86) • Supervised, splitting method • Inspired in the ID3 decision trees induction algorithm • It chooses the cut-points that creates intervals with minimal Entropy, that is, maximising the Information Gain

  11. ID3 splitting procedure • Start with a single interval • Identify the cut-point that creates two intervals with minimal entropy • Split the interval using the best cut-point • Recursively apply the method to S1 and S2 • Stop criterion: All instances in an interval have the same class S=original interval, S1,S2 candidate splits

  12. (Fayyad & Irani, 93) discretisation • Refinement of ID3 to make it more conservative • ID3 generates lots of intervals because the stop criterion is very loose • In this method, in order to split an interval, the difference between Entropy(S) and EntropyPartition(S,S1,S2) needs to be large enough • Stop criteria based on the Minimum Description Length (MDL) principle (Rissanen, 78) • MDL is a modern reformulation of the classic Occam’s Razor principle: “If you have two equally good explanations, always choose the simplest one”

  13. Minimum Description Length • MDL also comes from the information theory field, dealing with information transmission • The theory that minimizes its sizes and the size of the exceptions will be the best Receiver Sender Instances Instances + class How do we send the class of each instance? 1) Sending the classes + Theory description 2) Generating a theory and sending it plus its exceptions

  14. MDL for discretisation • Stop partitioning if this inequality is true Gain of discretising Cost of discretizing N = number of instances in the original interval, k = number of classes k1,k2 = number of classes represented in the subpartitions 1 and 2

  15. Unparametrized Supervised Discretizer (Giraldez et al., 02) • Supervised merging algorithm • It defines the quality of an interval by a measure called goodness • maxC(I) = number of examples belonging to the majority class in I • Errors(I) = number of examples not belonging to the majority class in I

  16. Unparametrized Supervised Discretizer (Giraldez et al., 02) • discretisation process • Starts with every possible cut-point in the domain • Identifies candidate pairs of intervals to join (Ii,Ii+1) if both conditions are true • Ii and Ii+1 have the same majority class or there is a tie in Ii oir Ii+1 • goodness(Ii+Ii+1)> [goodness(Ii)+goodness(Ii+1)]/2 • Merges the candidate pair with highest goodness • Repeats steps 2-4 until no more intervals are merged

  17. ChiMerge (Kerber, 92) • Merging supervised discretisation • Uses the c2 statistical test to decide whether to merge intervals or not • This test checks whether a discrete random variable presents an certain distribution • E.g. testing whether a dice is fair (all 6 outcomes are equally likely to occur) • Two intervals are merged if the null hypothesis (same distribution of classes) is true

  18. ChiMerge • c2 formula • A p-value can be computed from c2 andN. A predefined confidence level is used to reject the test • Aij= examples in interval I from class j • p = number of classes • Ri = examples in interval I • Cj = examples from class j • N = total number of examples • It iteratively merges intervals until the statistical test fails for every possible pairs of consecutive intervals

  19. Differences between discretisers (Bacardit, 2004)

  20. Resources • Good survey on discretisation methods with empirical validation using C4.5 • Implementation of the methods described in this lecture is available in the KEEL package • List of the 27 discretisation algorithms (with references) in KEEL

  21. Questions?

More Related