1 / 17

An Interval Classifier for Database Mining Applications

An Interval Classifier for Database Mining Applications . Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation Interval Classifier

ishi
Download Presentation

An Interval Classifier for Database Mining Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic

  2. Outline • Introduction • Motivation • Interval Classifier • Example • Results • Conclusion

  3. Introduction • Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database • Or – retrieve all examples from the database that belong to the desired class • Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)

  4. Motivation • Why an Interval Classifier? • Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification • Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive

  5. Interval Classifier (IC) • Key features: • Tree classifier • Categorical attributes – branches for each value • Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node • IC generates SQL queries as final classification functions!

  6. Interval Classifier - Algorithm • Algorithm: • Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) • For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A • Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong R R R G G G W W S S S S

  7. Interval Classifier - Algorithm • … • Merge adjacent intervals that have the same winners with the equal strengths • Divide training set of examples using calculated intervals • Strong intervals become leaves with assigned winning class • Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves

  8. Interval Classifier - Pruning • Pruning • Dynamic, while tree is generated • Find accuracy for the node using training set • Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy • The aim is to check whether the expansion will bring error reduction or not • To avoid pruning to aggressively – each node inherits from its parent a certain number of credits

  9. Example • Age: numerical, uniformly distributed 20-80 • Zip-code: categorical, uniformly • Level of Education, elevel: categorical, unif. • Two classes: • A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) • B: otherwise

  10. Example • 1000 training tuples • Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition • Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)

  11. Example • Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B

  12. Example • Proceed with weak nodes and repeat the same procedure • Finally: • Classes defined in the beginning: • A: (age<40 and elevel 0 to 1) OR • (40<=age<60 and elevel 0 to 3) OR • (age>=60 and elevel 0) • B: otherwise

  13. Results • Generate examples with smooth boundaries among the groups • Training set 2500 tuples, test 10000 • Fixed precision – threshold 0.9 • Adaptive precision – adaptive threshold • Error pruning – credits • Function 5 – nonlinear

  14. Results • Comparing to the ID3:

  15. Conclusion • IC interface efficiently with the database systems • Treatment of numerical attributes • Dynamic pruning • Too many user defined parameters? • Scalability? • In practice K-ary trees are less accurate than binary ones?

  16. References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp.560-573

  17. THANK YOU!

More Related