Data Stream Classification and Novel Class Detection

Data Stream Classification and Novel Class Detection Mehedy Masud, Latifur Khan, Qing Chen and Bhavani Thuraisingham Department of Computer Science , University of Texas at Dallas Jing Gao, Jiawei Han Department of Computer Science , University of Illionois at Urbana Champaign Charu Aggarwal IBM T. J. Watson This work was funded in part by Masud et al.

Outline of The Presentation • Background • Data Stream Classification • Novel Class Detection Masud et al.

Introduction • Continuous flow of data • Examples: Network traffic Sensor data Call center records Masud et al. Characteristics of Data streams are:

Data Stream Classification Expert analysis and labeling Block and quarantine Model update Network traffic Attack traffic Firewall Benign traffic Classification model Server Masud et al. Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making

Data Stream Classification (cont..) • What are the applications? • Security Monitoring • Network monitoring and traffic engineering. • Business : credit card transaction flows. • Telecommunication calling records. • Web logs and web page click streams. Masud et al.

Challenges Masud et al. Infinite length Concept-drift Concept-evolution Feature Evolution

Infinite Length 1 0 0 1 1 1 0 0 0 1 1 0 Masud et al. • Impractical to store and use all historical data • Requires infinite storage • And running time

Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Instances victim of concept-drift Positive instance Masud et al.

Concept-Evolution Novel class y y D D • - - - - - • - - - - - - - - - - • - - - - - • - - - - - - - - - - C X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 C A A • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 y2 B B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances Masud et al.

Dynamic Features Different chunks may have different feature sets Masud et al. • Why new features evolving • Infinite data stream • Normally, global feature set is unknown • New features may appear • Concept drift • As concept drifting, new features may appear • Concept evolution • New type of class normally holds new set of features

Dynamic Features ith chunk and i + 1st chunk and models have different feature sets runway, climb runway, clear, ramp Feature Set ith chunk i + 1st chunk runway, ground, ramp Current model Feature Space Conversion Classification & Novel Class Detection Feature Extraction & Selection Training New Model Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high. Masud et al.

Outline of The Presentation • Introduction • Data Stream Classification • Novel Class Detection Masud et al.

DataStream Classification (cont..) • Single Model Incremental Classification • Ensemble – model based classification • Supervised • Semi-supervised • Active learning Masud et al.

Overview • Single Model Incremental Classification • Ensemble – model based classification • Data Selection • Semi-supervised • Skewed Data I Masud et al.

Ensemble of Classifiers C1 + C2 + + x,? C3 - input Individual outputs voting Ensemble output Classifier Masud et al.

Ensemble Classification of Data Streams D1 D2 D5 D3 D4 C5 C4 C3 C2 C1 Prediction Note: Di may contain data points from different classes D5 D6 D4 Labeled chunk Data chunks Unlabeled chunk Addresses infinite length and concept-drift C5 C4 Classifiers C1 C2 C4 C3 C5 Ensemble Masud et al. • Divide the data stream into equal sized chunks • Train a classifier from each data chunk • Keep the best L such classifier-ensemble • Example: for L= 3

A completely new class of data arrives in the stream Concept-Evolution Problem ECSMiner Novel class y y • - - - - - • - - - - - - - - - - x<x1 D • - - - - - • - - - - - - - - - - X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 F T C y<y2 A y<y1 • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - F F ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + T T ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + + - - + y2 y2 A C B D B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x (a) (b) (c) (a) A decision tree, (b) corresponding feature space partitioning (c) A Novel class (denoted by x) arrives in the stream. Masud et al.

ECSMiner: Overview ECSMiner Data Stream Just arrived Older instances (labeled) Newer instances (unlabeled) Last labeled chunk xnow Buffering and novel class detection Buffer? Yes Outlier detection Training No Update Classification Ensemble of L models New model ML . . . M1 M2 Overview of ECSMineralgorithm Based on: Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp 79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)). Masud et al.

Algorithm ECSMiner Training Novel class detection and classification Masud et al.

Novel Class Detection Non parametric does not assume any underlying model of existing classes Steps: Creating and saving decision boundary during training Detecting and filtering outliers Measuring cohesion and separation among test and training instances ECSMiner Masud et al.

Training: Creating Decision Boundary ECSMiner Raw training data Pseudopoints Clusters are created y y • - - - - • - - • - - - - - - - D D y1 y1 C C A A • - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ y2 y2 B B +++ + + + + + + + + + + + + + x1 x x1 x Addresses Infinite length problem Masud et al.

Outlier Detection and Filtering ECSMiner Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Test instance x y x Ensemble of L models D y1 M1 ML M2 . . . C A Routlier Routlier Routlier x X is an existing class instance AND False y2 True B X is a filtered outlier (Foutlier) (potential novel class instance) x1 x Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. Masud et al.

Novel Class Detection ECSMiner Test instance x q-NSC>0 for q’>q Foutliers with all models? Ensemble of L models (Step 1) (Step 4) M1 ML M2 . . . Routlier Treat as existing class N Routlier Routlier X is an existing class instance AND (Step 2) False True X is a filtered outlier (Foutlier) (potential novel class instance) Compute q-NSC with all models and other Foutliers Y Novel class found (Step 3) Masud et al.

Computing Cohesion & Separation a(x) = mean distance from an Foutlierx to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): ECSMiner o,5(x) a(x) x -,5(x) +,5(x) b+(x) b-(x) • - - • - - + + + + + • - • - - + + + + • If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. Masud et al.

Speeding Up • Computing N-NSC for every Foutlier instance x takes quadratic time in the number of Foutliers. • In order to make the computation faster, • We create Ko pseudopoints (Fpseudopoints) from Foutliers using K-means clustering, • where Ko = (No/S) * K. Here S is the chunk size and No is the number of Foutliers. • perform the computations on the Fpseudopoints • Thus, the time complexity • to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K)) • which is constant, since both Ko and K are independent of the input size. • However, by gaining speed we lose some precision, although the loss is negligible (to be analyzed shortly) Masud et al.

Algorithm To Detect Novel Class ECSMiner Masud et al.

“Speedup” Penalty (x-i)2 i i x (i-j)2 j (x-j)2 j Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all instances in i belong to a novel class. Masud et al. • As discussed earlier • by speeding up computation in step – 3, we lose some precision since the result deviates from exact result • This analysis shows that the deviation is negligible

“Speedup” Penalty Approximate: Exact: Deviation: Masud et al.

Experiments - Datasets Masud et al. • We evaluated our approach on two synthetic and two real datasets: • SynC – Synthetic data with only concept-drift. Generated using hyperplane equation. 2 classes, 10 attributes, 250K instances • SynCN – Synthetic data with concept-drift and novel class. Generated using Gaussian distribution. 20 classes, 40 attributes, 400K instances • KDD cup 1999 intrusion detection (10% version) – real dataset. 23 classes, 34 attributes, 490K instances • Forest cover – real dataset. 7 classes, 54 attributes, 581K instances

Experiments - Setup Masud et al. • Development: • Language: Java • H/W: • Intel P-IV with • 2GB memory and • 3GHz dual processor CPU. • Parameter settings: • K (number of pseudopoints per chunk) = 50 • N (minimum number of instances required to declare novel class) = 50 • M (ensemble size) = 6 • S (chunk size) = 2,000

Experiments - Baseline Masud et al. • Competing approaches: • i) MineClass (MC): our approach • ii) WCE-OLINDDA_Parallel (W-OP) • iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default parameter settings for WCE and OLINDDA • We use this combination since to the best of our knowledge there is no approach that Can classify and detect novel classes simultaneously • OLINDDA assumes there is only one normal class, and all other classes are novel • Therefore, we apply two variations – • W-OP keeps parallel OLINDDA models, one for each class • W-OS keeps a single model that absorbs a novel class when encountered

Experiments - Results Masud et al. • Evaluation metrics • Mnew = % of novel class instances Misclassified as existing class = Fn∗100/Nc • Fnew = % of existing class instances Falsely identified as novel class = Fp∗100/ (N−Nc) • ERR = Total misclassification error (%)(including Mnew and Fnew) = (Fp+Fn+Fe)∗100/N • where Fn = total novel class instances misclassified as existing class, • Fp = total existing class instances misclassified as novel class, • Fe = total existing class instances misclassified (other than Fp), • Nc = total novel class instances in the stream, • N = total instances the stream.

Experiments - Results Forest Cover KDD cup SynCN Masud et al.

Experiments - Results Masud et al.

Experiments – Parameter Sensitivity Masud et al.

Experiments – Runtime Masud et al.

Dynamic Features • Solution: • Global Features • Local Features • Union Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection of Data Streams in A Dynamic Feature Space,” in Proc. of Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, Sept 2010, Springer, Page 337-352 Masud et al.

Feature Mapping Across Models and Test Data Points Masud et al. • Feature set varies in different chunks. Especially, when new class appears, new features should be selected and added to the feature set. • Strategy 1 – Lossy fixed (Lossy-F) conversion / Global • Use the same fixed feature in the entire stream. • We call this a lossy conversion because future model and instances may lose important features due to this mapping. • Strategy 2 – Lossy local (Lossy-L) conversion / Local • We call this lossy conversion because it may loss feature values during mapping. • Strategy 3 – Dimension preserving (D-Preserving) Mapping / Union

Feature Space Conversion – Lossy-L Mapping (Local) Masud et al. Assume that each data chunk has different feature vectors When a classification model is trained, we save the feature vector with the model When an instance is tested, its feature vector is mapped (i.e., projected) to the model’s feature vector.

Feature Space Conversion – Lossy-L Mapping Masud et al. • For example, • Suppose the model has two features (x,y) • The instance has two features (y,z) • When testing, assume the instance has two features (x,y) • Where x = 0, and y value is kept as it is

Conversion Strategy II – Lossy-L Mapping Masud et al. Graphically:

Conversion Strategy III – D-Preserving Mapping Masud et al. • When an instance is tested, both the model’s feature vector and the instance’s feature vector are mapped (i.e., projected) to the union of their feature vectors. • The feature dimension is increased. • In the mapping, both the features in the testing instance and model are preserved. The extra features are filled with all 0s.

Conversion Strategy III – D-Preserving Mapping Masud et al. • For example, • suppose the model has three features (a,b,c) • The instance has four features (b,c,d,e) • When testing, we project both the model’s feature vector and the instance’s feature vector to (a,b,c,d,e) • Therefore, in the model, d, and e will be considered 0s and in the instance, a will be considered 0

Conversion Strategy III – D-Preserving Mapping Masud et al. Previous Example

Discussion • Local does not favor novel class, it favors existing classes. • Local features will be enough to model existing classes. • Union favors novel class. • New features may be discriminating for novel class, hence Union works. Masud et al.

Comparison Masud et al. Which strategy is the better? Assumption: lossless conversion (union) preserves the properties of a novel class. In other words, if an instance belongs to a novel class, it remains outside the decision boundary of any model Mi of the ensemble M in the converted feature space. Lemma: If a test point x belongs to a novel class, it will be miss-classified by the ensemble M as an existing class instance under certain conditions when the Lossy-L conversion is used.

Comparison Masud et al. Proof: Let X1,…,XL,XL+1,…,XM be the dimensions of the model and Let X1,…,XL,XM+1,…,XN be the dimensions of the test point Suppose the radius of the closest cluster (in the higher dimension) is R Also, let the test point be a novel class instance. Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN

Comparison Masud et al. Proof (continued): Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN Centroid of the cluster (original space): X1=x1,…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM Centroid of the cluster (combined space): x1,…,xL, xL+1,…,xM , 0,…,0 Test point (original space): X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL, x’M+1,…,x’N Test point (combined space): x’1,…,x’L, 0,…,0, x’M+1,…,x’N

Comparison Masud et al. Proof (continued): Centroid (combined spc): x1,…,xL, xL+1,…,xM , 0 ,…, 0 Test point (combined space): x’1,…,x’L, 0,…, 0, x’M+1,…,x’N R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2M)+ (x’2M+1+…+x’2N) R2< a2 + b2 R2 = a2 + b2 - e2 (e2 >0) a2 = R2 + (e2 – b2) a2 < R2 (provided that e2 < b2) Therefore, in Lossy-L conversion, the test point will not be an outlier

Baseline Approaches Masud et al. WCE is Weighted Classifier Ensemble1, which addresses multi-class ensemble classifier. OLINDDA is a novel class detector 2 works only for binary class. FAE algorithm is an ensemble classifier that addresses feature evolution3 and concept drift. ECSMiner is a multi-class ensemble classifier that addresses concept drift and concept evolution4.

Data Stream Classification and Novel Class Detection

Data Stream Classification and Novel Class Detection

Presentation Transcript

Multi-Class and Structured Classification

Classification techniques for class imbalance data

Novel representations and methods in text classification

Novel representations and methods in text classification

Novel representations and methods in text classification

Data Classification

Classification and Novel Class Detection of Feature Based Stream Data.

Stream Data

Our Class Novel…

Data Stream Classification: Training with Limited Amount of Labeled Data

Automated Detection and Classification of NFRs

Compound classification and stream data analysis

Malware Classification And Detection

DATA CLASSIFICATION

Stream Classification Service

Data Classification

Data Classification and Symbolization

A-Class : a novel classification method

Emotion Classification and Detection

Novel Gas-based Detection Techniques

ONE-CLASS CLASSIFICATION

Multi-class Classification