1 / 42

A General Framework for Mining Massive Data Streams

A General Framework for Mining Massive Data Streams. Geoff Hulten Advised by Pedro Domingos. Mining Massive Data Streams. High-speed data streams abundant Large retailers Long distance & cellular phone call records Scientific projects Large Web sites

cbarela
Download Presentation

A General Framework for Mining Massive Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos

  2. Mining Massive Data Streams • High-speed data streams abundant • Large retailers • Long distance & cellular phone call records • Scientific projects • Large Web sites • Build model of the process creating data • Use model to interact more efficiently

  3. Growing Mismatch BetweenAlgorithms and Data • State of the art data mining algorithms • One shot learning • Work with static databases • Maximum of 1 million – 10 million records • Properties of Data Streams • Data stream exists over months or years • 10s – 100s of millions of new records per day • Process generating data changing over time

  4. The Cost of This Mismatch • Fraction of data we can effectively mine shrinking towards zero • Models learned from heuristically selected samples of data • Models out of date before being deployed

  5. Need New Algorithms • Monitor a data stream and have a model available at all times • Improve the model as data arrives • Adapt the model as process generating data changes • Have quality guarantees • Work within strict resource constraints

  6. Solution: General Framework • Applicable to algorithms based on discrete search • Semi-automatically converts algorithm to meet our design needs • Uses sampling to select data size for each search step • Extensions to continuous searches and relational data

  7. Outline • Introduction • Scaling up Decision Trees • Our Framework for Scaling • Other Applications and Results • Conclusion

  8. Decision Trees • Examples: • Encode: • Nodes contain tests • Leaves contain predictions Gender? Male Female Age? False < 25 >= 25 False True

  9. Decision Tree Induction DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)

  10. VFDT (Very Fast Decision Tree) • In order to pick split attribute for a node looking at a few example may be sufficient • Given a stream of examples: • Use the first to pick the split at the root • Sort succeeding ones to the leaves • Pick best attribute there • Continue… • Leaves predict most common class • Very fast, incremental, any time decision tree induction algorithm

  11. How Much Data? • Make sure best attribute is better than second • That is: • Using a sample so need Hoeffding bound • Collect data till:

  12. Core VFDT Algorithm Proceedure VFDT(Stream, δ) Let T = Tree with single leaf (root) Initialize sufficient statistics at root For each example (X, y) in Stream Sort (X, y) to leaf using T Update sufficient statistics at leaf Compute G for each attribute If G(best) – G(2nd best) > ε, then Split leaf on best attribute For each branch Start new leaf, init sufficient statistics Return T x1? male female y=0 x2? > 65 <= 65 y=0 y=1

  13. Quality of Trees from VFDT • Model may contain incorrect splits, useful? • Bound the difference with infinite data tree • Chance an arbitrary example takes different path • Intuition: example on level i of tree has i chances to go through a mistaken node

  14. Complete VFDT System • Memory management • Memory dominated by sufficient statistics • Deactivate less promising leaves when needed • Ties: • Wasteful to decide between identical attributes • Check for splits periodically • Pre-pruning • Only make splits that improve the value of G(.) • Early stop on bad attributes

  15. VFDT (Continued) • Bootstrap with traditional learner • Rescan dataset when time available • Time changing data streams • Post pruning • Continuous attributes • Batch mode

  16. Experiments • Compared VFDT and C4.5 (Quinlan, 1993) • Same memory limit for both (40 MB) • 100k examples for C4.5 • VFDT settings: δ = 10^-7, τ = 5% • Domains: 2 classes, 100 binary attributes • Fifteen synthetic trees 2.2k – 500k leaves • Noise from 0% to 30%

  17. Running Times • Pentium III at 500 MHz running Linux • C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds • VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process • VFDT processes 32k examples per second (excluding I/O)

  18. Real World Data Sets:Trace of UW Web requests • Stream of Web page request from UW • One week 23k clients, 170 orgs. 244k hosts, 82.8M requests (peak: 17k/min), 20GB • Goal: improve cache by predicting requests • 1.6M examples, 61% default class • C4.5 on 75k exs, 2975 secs. • 73.3% accuracy • VFDT ~3000 secs., 74.3% accurate

  19. Outline • Introduction • Scaling up Decision Trees • Our Framework for Scaling • Overview of Applications and Results • Conclusion

  20. Data Mining as Discrete Search • Initial state • Empty – prior – random • Search operators • Refine structure • Evaluation function • Likelihood – many other • Goal state • Local optimum, etc. ...

  21. Data Mining As Search Training Data Training Data Training Data 1.8 1.7 ... 1.9 ... 1.5 1.9 2.0

  22. Example: Decision Tree • Initial state • Root node • Search operators • Turn any leaf into a test on attribute • Evaluation • Entropy Reduction • Goal state • No further gain • Post prune Training Data Training Data X1? X1? ... 1.7 ?? ... 1.5 X1? Xd? ...

  23. Overview of Framework • Cast the learning algorithm as a search • Begin monitoring data stream • Use each example to update sufficient statistics where appropriate (then discard it) • Periodically pause and use statistical tests • Take steps that can be made with high confidence • Monitor old search decisions • Change them when data stream changes

  24. How Much Data is Enough? Training Data X1? 1.65 ... 1.38 Xd?

  25. How Much Data is Enough? • Use statistical bounds • Normal distribution • Hoeffding bound • Applies to scores that are average over examples • Can select a winner if • Score1 > Score2 + ε Sample of Data X1? 1.6 +/- ε ... 1.4 +/- ε Xd?

  26. Global Quality Guarantee • δ – probability of error in single decision • b – branching factor of search • d – depth of search • c – number of checks for winner δ* = δbdc

  27. Identical States And Ties • Fails if states are identical (or nearly so) • τ – user supplied tie parameter • Select winner early if alternatives differ by less than τ • Score1 > Score2 + ε or • ε <= τ

  28. Dealing with Time Changing Concepts • Maintain a window of the most recent examples • Keep model up to date with this window • Effective when window size similar to concept drift rate • Traditional approach • Periodically reapply learner • Very inefficient! • Our approach • Monitor quality of old decisions as window shifts • Correct decisions in fine-grained manner

  29. Alternate Searches • When new test looks better grow alternate sub-tree • Replace the old when new is more accurate • This smoothly adjusts to changing concepts Gender? Pets? College? false Hair? true true false false true

  30. RAM Limitations • Each search requires sufficient statistics structure • Decision Tree • O(avc) RAM • Bayesian Network • O(c^p) RAM

  31. RAM Limitations Temporarily inactive Active

  32. Outline • Introduction • Data Mining as Discrete Search • Our Framework for Scaling • Application to Decision Trees • Other Applications and Results • Conclusion

  33. Applications • VFDT (KDD ’00) – Decision Trees • CVFDT (KDD ’01) – VFDT + concept drift • VFBN & VFBN2 (KDD ’02) – Bayesian Networks • Continuous Searches • VFKM (ICML ’01) – K-Means clustering • VFEM (NIPS ’01) – EM for mixtures of Gaussians • Relational Data Sets • VFREL (Submitted) – Feature selection in relational data

  34. CFVDT Experiments

  35. Activity Profile for VFBN

  36. Other Real World Data Sets • Trace of all web requests from UW campus • Use clustering to find good locations for proxy caches • KDD Cup 2000 Data set • 700k page requests from an e-commerce site • Categorize pages into 65 categories, predict which a session will visit • UW CSE Data set • 8 Million sessions over two years • Predict which of 80 level 2 directories each visits • Web Crawl of .edu sites • Two data sets each with two million web pages • Use relational structure to predict which will increase in popularity over time

  37. Related Work • DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93) • Framework for scaling rule learning • RainForest (Gehrke, Ramakrishnan, Ganti ‘98) • Framework for scaling decision trees • ADtrees (Moore, Lee ‘97) • Accelerate computing sufficient stats • PALO (Greiner ‘92) • Accelerate hill climbing search via sampling • DEMON (Ganti, Gehrke, Ramakrishnan ‘00) • Framework for converting incremental algs. for time changing data streams

  38. Future Work • Combine framework for discrete search with frameworks for continuous search and relational learning • Further study time changing processes • Develop a language for specifying data stream learning algorithms • Use framework to develop novel algorithms for massive data streams • Apply algorithms to more real-world problems

  39. Conclusion • Framework helps scale up learning algorithms based on discrete search • Resulting algorithms: • Work on databases and data streams • Work with limited resources • Adapt to time changing concepts • Learn in time proportional to concept complexity • Independent of amount of training data! • Benefits have been demonstrated in a series of applications

More Related