1 / 20

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

This paper introduces a framework for mining concept-drifting data streams with skewed distributions using sampling and ensemble techniques, providing accurate and efficient classification models. It also discusses general issues in stream classification and showcases experiments on synthetic and real data.

Download Presentation

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡ †University of Illinois at Urbana-Champaign ‡IBM T. J. Watson Research Center

  2. 1 1 0 0 1 0 1 1 1 0 1 Introduction (1) • Data Stream • Continuously arriving data flow • Applications: network traffic, credit card transaction flow, phone calling records, etc.

  3. Introduction (2) • Stream Classification • Construct a classification model based on past records • Use the model to predict labels for new data • Help decision making Classification model Fraud? Labeling Fraud

  4. Framework ? ……… ……… Classification Model Predict

  5. Concept Drifts • Changes in P(x,y) • P(x,y)=P(y|x)P(x) x-feature vector, y-class label • No Change, Feature Change, Conditional Change, Dual Change • Expected error is not a good indicator of concept drifts • Training on the most recent data could help reduce expected error Time Stamp 1 Time Stamp 11 Time Stamp 21

  6. Issues in Stream Classification(1) • Generative Model • P(y|x) follows some distribution • Descriptive Model • Let data decides • Stream Data • Distribution unknown and evolving

  7. Issues in Stream Classification(2) • Label Prediction • Classify x into one class • Probability Estimation • x is assigned to all classes with different probabilities • Stream Applications • Stochastic, prediction confidence information is needed

  8. Mining Skewed Data Stream - • Skewed Distribution • Credit card frauds, network intrusions • Existing Stream Classification Algorithms • Evaluated on balanced data • Problems • Ignore minority examples • The cost of misclassifying minority examples is usually huge + Classify every leaf node as negative

  9. Stream Ensemble Approach (1) ? ……… ……… Training set?Insufficient positive examples! Step 1 Sampling

  10. Stream Ensemble Approach (2) 1 2 …… k C1 C2 …… Ck Step 2 Ensemble

  11. Why this approach works? • Incorporation of old positive examples • increase the training size, reduce variance • negative examples reflect current concepts, so the increase in boundary bias is small • Ensemble • reduce variance caused by single model • disjoint sets of negative examples—the classifiers will make uncorrelated errors • Bagging & Boosting • running cost is much higher • cannot generate reliable probability estimates for skewed distributions

  12. Analysis • Error Reduction • Sampling • Ensemble • Efficiency Analysis • Single model • Ensemble • Ensemble is more efficient

  13. Experiments • Measures • Mean Squared Error • ROC Curve • Recall-Precision Curve • Baseline Methods • NS: No sampling +Single Model • SS: Sampling + Single Model • SE: Sampling + Ensemble

  14. Experimental Results (1) Feature Change only P(x) changes Dual Change both P(x) and P(y|x) changes Conditional Change only P(y|x) changes Mean Squared Error on Synthetic Data

  15. Experimental Results (2) Mean Squared Error on Real Data

  16. Experimental Results (3) Plots on Synthetic Data ROC Curve Recall-Precision Plot

  17. Experimental Results (4) Plots on Real Data ROC Curve Recall-Precision Plot

  18. Experimental Results (5) Training Time

  19. Conclusions • General issues in stream classification • concept drifts • descriptive model • probability estimation • Mining skewed data streams • sampling and ensemble techniques • accurate and efficient • Wide applications • graph data • airforce data

  20. Thanks! • Any questions?

More Related