1 / 30

Classification and Novel Class Detection in Data Streams

Classification and Novel Class Detection in Data Streams. Mehedy Masud 1 , Latifur Khan 1 , Jing Gao 2 , Jiawei Han 2 , and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas

london
Download Presentation

Classification and Novel Class Detection in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification and Novel Class Detection in Data Streams Mehedy Masud1, Latifur Khan1, Jing Gao2, Jiawei Han2, and Bhavani Thuraisingham1 1Department of Computer Science,University of Texas at Dallas 2Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by

  2. Presentation Overview • Stream Mining Background • Novel Class Detection– Concept Evolution

  3. Data Streams • Continuous flows of data • Examples: Network traffic Sensor data Call center records Data streams are:

  4. Data Stream Classification Expert analysis and labeling Block and quarantine Model update Network traffic Attack traffic Firewall Benign traffic Classification model Server Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making

  5. Challenges Introduction Infinite length Concept-drift Concept-evolution (emergence of novel class) Recurrence (seasonal) class ICDM 2012, Brussels, Belgium

  6. Infinite Length 1 0 0 1 1 1 0 0 0 1 1 0 • Impractical to store and use all historical data • Requires infinite storage • And running time

  7. Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Instances victim of concept-drift Positive instance

  8. Concept-Evolution Novel class y y D D • - - - - - • - - - - - - - - - - • - - - - - • - - - - - - - - - - C X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 C A A • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 y2 B B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances

  9. Background: Ensemble of Classifiers C1 + C2 + + x,? C3 - input Individual outputs voting Ensemble output Classifier

  10. Background: Ensemble Classification of Data Streams D1 D2 D5 D3 D4 C5 C4 C3 C2 C1 Prediction Note: Di may contain data points from different classes D5 D6 D4 Labeled chunk Data chunks Unlabeled chunk Addresses infinite length and concept-drift C5 C4 Classifiers C1 C4 C2 C5 C3 Ensemble • Divide the data stream into equal sized chunks • Train a classifier from each data chunk • Keep the best L such classifier-ensemble • Example: for L= 3

  11. Examples of Recurrence and Novel Classes Introduction • Twitter Stream – a stream of messages • Each message may be given a category or “class” • based on the topic • Examples • “Election 2012”, “London Olympic”, “Halloween”, “Christmas”, “Hurricane Sandy”, etc. • Among these • “Election 2012” or “Hurricane Sandy” are novel classes because they are new events. • Also • “Halloween” is recurrence class because it “recurs” every year. ICDM 2012, Brussels, Belgium

  12. Concept-Evolution and Feature Space Introduction Novel class y y D D • - - - - - • - - - - - - - - - - • - - - - - • - - - - - - - - - - C X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 C A A • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 y2 B B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances ICDM 2012, Brussels, Belgium

  13. Novel Class Detection – Prior Work Prior work • Three steps: • Training and building decision boundary • Outlier detection and filtering • Computing cohesion and separation ICDM 2012, Brussels, Belgium

  14. Training: Creating Decision Boundary Prior work • Training is done chunk-by-chunk (One classifier per chunk) • An ensemble of classifiers are used for classification Pseudopoints Raw training data y Clusters are created D y • - - - - • - - • - - - - - - - D y1 C y1 A C A • - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ y2 B y2 B +++ + + + + + + + + + + + + + x1 x x1 x Addresses Infinite length problem ICDM 2012, Brussels, Belgium

  15. Outlier Detection and Filtering Prior work Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Test instance x y x Ensemble of L models D y1 M1 ML M2 . . . C A Routlier Routlier Routlier x X is an existing class instance AND False y2 True B X is a filtered outlier (Foutlier) (potential novel class instance) x1 x Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. ICDM 2012, Brussels, Belgium

  16. Computing Cohesion & Separation a(x) = mean distance from an Foutlierx to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): Prior work o,5(x) a(x) x -,5(x) +,5(x) b+(x) b-(x) • - - • - - + + + + + • - • - - + + + + • If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. ICDM 2012, Brussels, Belgium

  17. Limitation: Recurrence Class Prior work Recurrence chunk101 chunk102 chunk149 chunk150 Novel chunk51 chunk52 chunk99 chunk100 Stream chunk0 chunk1 chunk49 chunk50 ICDM 2012, Brussels, Belgium

  18. Why Recurrence Classes are Forgotten? Prior work D1 D2 D4 D3 D5 C5 C3 C4 C2 C1 Prediction D6 D5 D4 Labeled chunk Data chunks Unlabeled chunk C5 C4 Classifiers Ensemble C1 C4 C2 C5 C3 Addresses infinite length and concept-drift • Divide the data stream into equal sized chunks • Train a classifier from whole data chunk • Keep the best L such classifier-ensemble • Example: for L= 3 • Therefore, old models are discarded • Old classes are “forgotten” after a while ICDM 2012, Brussels, Belgium

  19. Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest Labeled chunk Training New model Update Latest unlabeled instance Outlier detection Ensemble (M) (keeps all classes) Classify using M Not outlier Outlier (Existing class) Buffering and novel class detection ICDM 2012, Brussels, Belgium

  20. Proposed method Training and Updating • Each chunk is first separated into different classes • A micro-classifier is trained from each class’s data • Each micro-classifier replaces one existing micro-classifier • A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE) • C such MCE’s constitute the whole ensemble, E ICDM 2012, Brussels, Belgium

  21. Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest Labeled chunk Training New model Update Latest unlabeled instance Outlier detection Ensemble (M) (keeps all classes) Classify using M Not outlier Outlier (Existing class) Buffering and novel class detection ICDM 2012, Brussels, Belgium

  22. Proposed method Outlier Detection and Classification • A test instance x is first classified with each micro-classifier ensemble • Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean) • If all ensembles flags x as outlier, then it is buffered and sent to novel class detector • Otherwise, the partial outputs are combined and a class label is predicted ICDM 2012, Brussels, Belgium

  23. Evaluation Evaluation • Competitors: • CLAM (CL) – proposed work • SCANR (SC) [1] – prior work • ECSMiner (EM) [2] – prior work • Olindda [3]-WCE [4] (OW) – another baseline • Datasets: Synthetic, KDD Cup 1999 & Forest covertype 1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181. 2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011). 3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008. 4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM. ICDM 2012, Brussels, Belgium

  24. Evaluation Overall Error Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD ICDM 2012, Brussels, Belgium

  25. Evaluation Number of Recurring Classes vs Error ICDM 2012, Brussels, Belgium

  26. Evaluation Error vs Drift and Chunk Size ICDM 2012, Brussels, Belgium

  27. Evaluation Summary Table ICDM 2012, Brussels, Belgium

  28. Conclusion • Detect Recurrence • Improved Accuracy • Running Time • Reduced Human Interaction • Future work: use other base learners ICDM 2012, Brussels, Belgium

  29. Questions ? ICDM 2012, Brussels, Belgium

  30. Thanks ICDM 2012, Brussels, Belgium

More Related