1 / 20

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams. Xiangnan Kong, Philip S. Yu. Dept. of Computer Science University of Illinois at Chicago. Introduction: Data Stream . Data Stream: high speed data flow continuously arriving, changing. Applications:

wanda
Download Presentation

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Xiangnan Kong,Philip S. Yu Dept. of Computer Science University of Illinois at Chicago

  2. Introduction: Data Stream • Data Stream: high speed data flow continuously arriving, changing • Applications: • online message classification • network traffic monitoring • credit card transactions classification online message network traffic credit card transactions

  3. Introduction: Stream Classification • Stream Classification: • Construct a classification model on past stream data • Use the model to predict the class label for incoming data Classification Model train classify Training data Incoming data + - + - ? ? ? data stream

  4. Multi-Label Stream Data Conventional Stream Classification: • Single-label settings: assume one stream object can only have one label • In many real apps, one stream object can have multiple labels. Company Legendary Sad • … • Labels • News Article • Emails • Labels

  5. object instance label label …… object instance label …… label Multi-Label Stream Classification • Traditional Stream Classification • Multi-label Stream Classification

  6. …… …… The problem 0 1 • Stream Data • Huge Data Volume + Limited memory cannot store the entire dataset for training Require one-pass algorithm on the stream • High Speed Need to process promptly • Concept Drifts Old data become outdated 0 0 1 0 • Multi-label Classification • Large number of possible label sets (exponential) • Conventional multi-label classification approach focus on offline settings, cannot apply here

  7. Our Solution: • Random Tree very fast in training and testing • Ensemble of multiple trees effective and can reduce the prediction variance • Statistics of multiple labels on the tree nodes effective training/testing on multiple labels • Fading function reduce the influence of old data …

  8. Multi-label Random Tree • Conventional Decision Trees Multi-pass over the dataset Variable selection on each node split Single label prediction Static updates, use the entire dataset including outdated data • Multi-label Random Tree Single-pass over the data Split each node on random variable with random threshold Ensemble of multiple trees Multi-label predictions Fading out old data

  9. Training: update trees Update node statistics d a e c Tree 1 Tree Nt f … … Update node statistics

  10. On the Tree Nodes • Statistics on the node • Aggregated label relevance vector • Aggregated number of instances • Aggregated label set cardinalities • Time stamp of the latest update Tree Node statistics a c • Fading function • The statistics are rescaled with a time fading function • To reduce the effect of the old data on the node statistics

  11. Prediction ? ? ? Aggregate predictions Tree Nt Tree 1 … … • Use the aggregated label relevance to rank all possible labels • Use the aggregated set cardinality to decide how many labels are included in the label set

  12. Experiment Setup • Three methods are compared: • Stream Multi-lAbelRandom Tree (SMART) • Multi-label stream classification with random tree[This Paper] • SMARTwithout fading function • SMART(static): keep updating the trees without fading • Multi-label kNN • state-of-the-art multi-label classification method + sliding window

  13. Data Sets • Three multi-label stream classification datasets: • MediaMill: Video annotation task, from “MediaMill Challenge” • TMC2007:Text classification task, from SDM text mining competition • RCV1-v2: large-scale text classification task, from Reuters dataset • --- # instances • --- # labels • --- # features • --- label density

  14. Evaluation • Multi-Label Metrics [Elisseef&Weston NIPS’02] • Ranking Loss ↓ • Evaluate the performance on the probability outputs • Average number of label pairs being ranked incorrectly • The smaller the better • Micro F1↑ • Evaluate the performance on label set prediction • Consider both micro average of precision and recall • The larger the better • Sequential evaluation with concept drifts • Mixing two streams

  15. Throughput / Efficiency

  16. Effectiveness MediaMill Dataset • Our approach with multi-label streaming random trees performed best in MediaMill dataset SMART (static) without fading func Multi-Label kNN (w=100) Ranking Loss (lower is better) (w=200) (w=400) SMART Multi-label Stream Classification Stream (x 4,300 instances)

  17. Effectiveness MediaMill Dataset SMART Multi-label Stream Classification Multi-Label kNN (w=100) Micro F1 (higher is better) (w=200) (w=400) SMART (static) without fading func Stream (x 4,300 instances)

  18. Experiment Results Micro F1 Ranking Loss MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset

  19. Experiment Results Micro F1 Ranking Loss MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset

  20. Conclusions • An Ensemble-based approach for Fast Classification of Multi-Label Data Stream • Ensemble-based approach (effective) • Predict multiple labels • Very fast in training/updating node statistics and prediction using random trees (efficient) Thank you!

More Related