1 / 61

A Black-Box approach to machine learning

A Black-Box approach to machine learning. Yoav Freund. Why do we need learning?. Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function

sylvana
Download Presentation

A Black-Box approach to machine learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Black-Box approach to machine learning Yoav Freund

  2. Why do we need learning? • Computers need functions that map highly variable data: • Speech recognition: Audio signal -> words • Image analysis: Video signal -> objects • Bio-Informatics: Micro-array Images -> gene function • Data Mining: Transaction logs -> customer classification • For accuracy, functions must be tuned to fit the data source. • For real-time processing, function computation has to be very fast.

  3. Trivial performance The complexity/accuracy tradeoff Error Complexity

  4. The speed/flexibility tradeoff Matlab Code Java Code Flexibility Machine code Digital Hardware Analog Hardware Speed

  5. Theory Vs. Practice • Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations. - I prove theorems. • Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. • My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.

  6. Plan of talk • The black-box approach • Boosting • Alternating decision trees • A commercial application • Boosting the margin • Confidence rated predictions • Online learning

  7. The black-box approach • Statistical models are not generators, they are predictors. • A predictor is a function from observationX to actionZ. • After actionis taken, outcomeY is observed which implies lossL (a real valued number). • Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)

  8. A predictor x z Training examples A learner Main software components We assume the predictor will be applied to examples similar to those on which it was trained

  9. Training Examples predictor Target System feedback Learning in a system Learning System Sensor Data Action

  10. OutcomeY - finite set {1,..,K} PredictionZ - {1,…,K} Special case: Classification Observation X - arbitrary (measurable) space Usually K=2 (binary classification)

  11. Data distribution: Generalization error: Training set: Training error: batch learning for binary classification

  12. Boosting Combining weak learners

  13. Feature vectors Binary labels {-1,+1} Positive weights A weighted training set

  14. A weak rule h instances predictions The weak requirement: A weak learner Weighted training set Weak Leaner h

  15. weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h3 h4 h5 h6 h7 h8 h9 hT Finalrule: The boosting process

  16. Adaboost

  17. Main property of Adaboost If advantages of weak rules over random guessing are: g1,g2,..,gTthen training error of final rule is at most

  18. Strong Learner Accurate Rule Weak Learner Weak rule Example weights Booster Boosting block diagram

  19. What is a good weak learner? The set of weak rules (features) should be: • flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Simple enough to allow efficient search for a rule with non-trivial weighted training error. • Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.

  20. Alternating decision trees Freund, Mason 1997

  21. Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1

  22. Y -1 +1 X>3 no yes -1 sign Y>5 no yes X A decision tree as a sum of weak rules. -0.2 -0.1 +0.1 +0.2 -0.2 +0.1 -0.1 -0.3 -0.3 +0.2

  23. Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7

  24. Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.

  25. AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick

  26. Commercial Deployment.

  27. AT&T “buisosity” problem Freund, Mason, Rogers, Pregibon, Cortes 2000 • Distinguish business/residence customers from call detail information. (time of day, length of call …) • 230M telephone numbers, label unknown for ~30% • 260M calls / day • Required computer resources: • Huge:counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). • Significant: Calculating the classification for ~70M customers. • Negligible:Learning (2 Hours on 10K training examples on an off-line computer).

  28. AD-tree for “buisosity”

  29. AD-tree (Detail)

  30. Precision/recall: Accuracy Score Quantifiable results • For accuracy 94% increased coverage from 44% to 56%. • Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

  31. Adaboost’s resistance to over fitting Why statisticians find Adaboost interesting.

  32. A very curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters

  33. Large margins Thesis: large margins => reliable predictions Very similar to SVM.

  34. Experimental Evidence

  35. C Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d No dependence on no. of combined functions!!!

  36. Idea of Proof

  37. Confidence rated predictions Agreement gives confidence

  38. Unsure - - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + Unsure - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?

  39. Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm Freund, Mansour, Schapire 2001

  40. Suggested tuning Suppose H is a finite set. Yields:

  41. Training examples Confidence-rated Rule Candidate Rules Rater- Combiner Confidence Rating block diagram

  42. Face Detection Viola & Jones 1999 • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

  43. All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2

  44. Using confidence to train car detectors

  45. Original Image Vs. difference image

  46. Confident Predictions Confident Predictions Co-training Blum and Mitchell 98 Partially trained B/W based Classifier Raw B/W Hwy Images Diff Image Partially trained Diff based Classifier

  47. Raw Image detector Difference Image detector Before co-training After co-training Co-Training Results Levin, Freund, Viola 2002

  48. Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby

  49. Online learning Adapting to changes

  50. An expert is an algorithm that maps the past to a prediction Online learning So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we don’t know which one.

More Related