1 / 42

Improving Supervised Classification using Confidence Weighted Learning

Improving Supervised Classification using Confidence Weighted Learning. Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira. Workshop in Machine Learning, The EE department, Technion January 20, 2010. Linear Classifiers. Input Instance to be classified.

faraji
Download Presentation

Improving Supervised Classification using Confidence Weighted Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Supervised Classification using Confidence Weighted Learning Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira Workshop in Machine Learning, The EE department, Technion January 20, 2010

  2. Linear Classifiers Input Instance to be classified Weight vector of classifier

  3. Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Natural Language Processing Counts Feature Rank

  4. Sentiment Classification • Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended! Pang, Lee, Vaithyanathan, EMNLP 2002

  5. Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y

  6. Sentiment Classification • Many positive reviews with the word best Wbest • Later negative review • “boring book – best if you want to sleep in seconds” • Linear update will reduce both Wbest Wboring • But best appeared more than boring • Better to reduce words in different rate Wboring Wbest

  7. Linear Model  Distribution over Linear Models Mean weight-vector Example

  8. New Prediction Models • Gaussian distributions over weight vectors • The covariance is either full or diagonal • In NLP we have many features and use a diagonal covariance

  9. Weight Vector (Version) Space The algorithm forces that most of the values of would reside in this region

  10. Passive Step Nothing to do, most of the weight vectors already classifies the example correctly

  11. Aggressive Step The mean is moved beyond the mistake-line (Large Margin) The covariance is shirked in the direction of the input example The algorithm projects the current Gaussian distribution on the half-space

  12. The Update • Projection update: • Can be solved analytically

  13. 20 features 2 informative (rotated skewed Gaussian) 18 noisy Using a single feature is as good as random prediction Synthetic Data

  14. Synthetic Data (cntd.) Distribution after 50 examples (x1)

  15. Synthetic Data (results) Perceptron PA 2nd Order CW-full CW-diag

  16. Data • Binary document classification • Sentiment reviews: • 6 Amazon domains (Blitzer et al) • Reuters (RCV1): • 3 pairs of labels • 20 News Groups: • 3 of labels • About 2000 instances per dataset • Bag of words representation • 10 Fold Cross-Validation; 5 epochs

  17. Results vs Batch - Sentiment • always better than batch methods • 3/6 significantly better

  18. Results vs Batch - 20NG + Reuters • 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

  19. Parallel Training • Split large data into disjoint sets • Train using each set independently • Combine resulting classifiers • AverageClassifiers performance • Uniformmean of linear weights • Weighted mean of linear weights using confidence information

  20. Parallel Training • Data Size: Sentiment ~1M ; Reuters ~0.8M • #Features/#Docs: Sentiment ~13 ; Reuters ~0.35 • Performance degradateswith number of splits • Weighting improves performance Baseline (CW)

  21. Multiple constraints per instance Approximate using a single constraint Crammer, Dredze, Kulesza. EMNLP 2008 Multi-Class Update Constraints for labels Approximation

  22. Evaluation Setup Nine multi-class datasets Crammer, Dredze, Kulesza. EMNLP 2008

  23. Evaluation Crammer, Dredze, Kulesza. EMNLP 2008 Better than all baselines (online and batch): 8 of 9 datasets

  24. 20 Newsgroups Crammer, Dredze, Kulesza. EMNLP 2008 Better than all online baselines: 8 of 9 datasets

  25. Dredze, Kulesza, Crammer. MLJ 2009 Multi-Domain Learning • Task: sentiment classification • Goal: reviews differ across domains • Electronics • Books • Movies • Kitchen Appliances • Challenge: domains differ • Domains use different features • Domains may behave differently towards features Blitzer, Dredze, Pereira, ACL 2007

  26. Shared parameters Parameters used for every domain Domain parameters Separate parameters for every domain Differing Feature Behaviors • Share similar behaviors across domains • Learn domain specific behaviors

  27. Combining Domain Parameters Shared 2 Domain Specific -1 Combined .5

  28. Combined classifier Individual classifiers Weighting Classifier Combination • CW classifier is a distribution over weight vectors

  29. Multi-Domain Regularization • Combined classifier for prediction and updates • Based on Evgeniou and Pontil, KDD 2004 • Passive-aggressive update rule • Find shared model and individual model closest to current corresponding models • Such that their combination will perform well on current example Smallest parameter change 1) 2) Classify example correctly

  30. Evaluation on Sentiment • Sentiment classification • Rate product reviews: positive/negative • 4 datasets • All- 7 Amazon product types • Books- different rating thresholds • DVDs- different rating thresholds • Books+DVDs • 1500 train, 100 test per domain

  31. Results Books, DVDs, Books+DVDsp=.001 Test error (smaller better) 10-fold CV, one pass online training

  32. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  33. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  34. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  35. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  36. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  37. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  38. Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

  39. Dredze & Crammer, ACL 2008 Picking the Next Example • Random • Linear Classifiers • Example with lowest margin • Active Confidence Learning : • Example with least confidence • Equivalent to lowest normalized-margin

  40. Dredze & Crammer, ACL 2008 Active Learning • 13 Datasets : • Sentiment (4), 20NG (3), Reuters (3), SPAM (3)

  41. Dredze & Crammer, ACL 2008 Active Learning • Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data

  42. Summary • Online training is fast and effective … • … but, NLP data has heavy-tailed feature distribution • New Model: • Add feature confidence parameters • Benefits: • better than state-of-the-art training algorithms for linear classifiers • Converges faster • Theoretical guaranties • Allows better combination of models trained in parallel and better active learning • better domain adaptation

More Related