Data Stream Management SystemsCheckpoint CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a KDD04 tutorial by Haixun Wang, Jian Pei & Philip Yu
Mining Data Streams: Challenges • On-line response (NB), limited memory, most recent windows only • Fast & Light algorithms needed: • Must minimize usage of memory and CPU • Requires only one (or a few) passes through data • Concept shift/drift: change mining set statistics • Render previously learned models inaccurate or invalid • Robustness and Adaptability: quickly recover/adjust after concept changes. • Popular machine learning algorithms no longer effective: • Neural nets: slow learner requires many passes • Support Vector Machines (SVM): computationally expensive • Apriori: many passes and expensive (association rule mine difficult for on data streams)
The Decision Tree Classifier • Learning (Training) : • Input: a data set of (a, b), where a is a vector, b a class label • Output: a model (decision tree) • Testing: • Input: a test sample (x, ?) • Output: a class label prediction for x
Decision Tree Classifiers • A divide-and-conquer approach • Simple algorithm, intuitive model • Typically a decision tree grows one level for each scan of data • Multiple scans are required • But if we can use small samples these problem disappears • But data structure is not ‘stable’ • Subtle changes of data can cause global changes in the data structure
Stable Trees Using Samples How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical? • Categorical attributes: • with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner • identical decision tree • Continuous attributes: • discretize them into categorical ones ...Forget concept changes for now
Hoeffding Trees • Hoeffding bound is applied to the information gain • Error decreases when n (# of samples) increases • At each node, we shall accumulate enough samples (n) before we make a split • Scales better than traditional DT algorithms • Incremental: the nodes are are created incrementally as news samples stream in • Sub-linear with sampling • Small memory requirement • Cons: • Only consider top 2 attributes • Tie breaking takes time • Grow a deep tree takes time • Discrete attribute only
VFDT • Very Fast Decision Tree [Domingos, Hulten 2000] • Several Improvements: faster and less memory • Concept Changes? A naïve approach: • Place a sliding window on the stream • Reapply C4.5 or VFDT whenever window moves • Time consuming!
CVFDT • Concept-adapting VFDT • Hulten, Spencer, Domingos, 2001 • Goal • Classifying concept-drifting data streams • Approach • Make use of Hoeffding bound • Incorporate “windowing” • Monitor changes of information gain for attributes. • If change reaches threshold, generate alternate subtree with new “best” attribute, but keep on background. • Replace if new subtree becomes more accurate.
Classifiers for Data Streams • Fast and Light Classifiers: • Naïve Bayesian: one pass to count occurrences • Sliding windows, tumbles and slides • Adaptive Nearest Neighbor Classification Algorithm--ANNCAD Fast and Light Classifiers • Ensembles of Classifiers--decision trees or others • Bagging Ensembles and • Boosting Ensembles
Basic Ideas • Stream partitioned into sequential chunks • Train a classifier from each chunk • Accuracy of voting ensembles is normally better than that of a single classfier. • Method1. Bagging • Weighted voting: weights are assigned to classifiers based on their recent performance on the current test examples • Only top K classifiers are used • Method2. Boosting • Majority voting • Classifiers retired by age • Boosting used in training
Mining Streams with Concept Changes • Changes detected by drop in accuracy or by other methods • Build new classifiers on new windows • Search among old ones those that have now become accurate
Boosting Ensembles for Adaptive Mining of Data Streams Andrea Fang Chu, Carlo Zaniolo [PAKDD2004]
Mining Data Stream: Desiderata Fast learning (preferably in one pass of the data.) Light requirements (low time complexity, low memory requirement) Adaptation (model always reflects the time-changing concept)
Adaptive Boosting Ensembles Training stream is split into blocks (i.e., windows) Each individual classifier is learned from a block. A boosting ensemble of (7—19 members) is maintained over time Decisions are taken by simple majority As the N+1 classifier is build, boost the weight of the tuples misclassified by the first N Change detection is explored to achieve adaptation.
Fast and Light Experiments show that boosting ensembles of “weak learners” provide accurate prediction Weak Learners An aggressively pruned decision tree, e.g., shallow tree (this means fast!) Trained on a small set of examples (this mean light in memory requirements!)
Adaptation: Detect changes that cause significant drops in ensemble performance gradual changes: concept drift abrupt changes: concept schift
Adaptability • The error rate is viewed as a random variable • When it drops significantly from the recent average the whole ensemble is dropped • And a new one is quickly re-learned • Cost/performance of boosting ensembles is better than that of bagging ensembles [KDD04] • BUT ???
References • Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept Drifting Data Streams using Ensemble Classifiers. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2003. • Pedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2000. • Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2001. • Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data Streams. In the SIAM International Conference on Data Mining (SIAM DM) • 2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data streams, 4th IEEE International Conference on Data Mining (ICDM), 2004 • Fang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of Data Streams. PAKDD 2004: 282-292. • Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor Classification Algorithm for Data Streams, 2005 ECML/PKDD Conference, Porto, Portugal, October 3-7, 2005.