1 / 30

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter1: Introduction Chapter2: Overview of Supervised Learning. 2006.01.20. Supervised learning. Training data set: several features and outcome Build a learner based on training data sets Predict the future unseen outcome from seen features of data.

kolina
Download Presentation

Chapter1: Introduction Chapter2: Overview of Supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter1: IntroductionChapter2: Overview of Supervised Learning 2006.01.20

  2. Supervised learning • Training data set: several features and outcome • Build a learner based on training data sets • Predict the future unseen outcome from seen features of data

  3. An example of supervised learning Email spam Known Normal Emails … … … Spam … … … … New emails Learner … Spam Unknown Normal emails

  4. Input & Output • Input = predictor = independent variable • Output = response = dependent variable

  5. Output Types • Quantitative >> regression • Ex) stock price, temperature, age • Qualitative >> classification • Ex) Yes/No,

  6. Input Types • Quantitative • Qualitative • Ordered categorical • Ex) small, medium, big

  7. Terminology • X : input • Xj : j th component • X : matrix • xj : j th observed value • Y : quantitative output • Y : prediction • G: qualitative output ^

  8. unknown General model • Given input X, output Y • Want to estimate the function f based on known data set (training data)

  9. Two simple methods • Linear model, linear regression • Nearest neighbor method

  10. Linear model • Give a vector of input features X = (X1…Xp) • Assume the linear relationship: • Least squares standard: min -2

  11. Classification example in two dimensions -1

  12. new Nearest neighbor method • Majority vote within the k nearest neighbors K= 1: brown K= 3: green

  13. Classification example in two dimensions -2

  14. Linear model #parameters: p Stable, smooth Low variance, high bias K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Linear model vs. K-nearest neighbor Each method has its own situations for which it works best.

  15. Misclassification curves

  16. Enhanced Methods • Kernel methods using weights • Modifying the distance kernels • Locally weighted least squares • Expansion of inputs for arbitrarily complex models • Projection & neural network

  17. Statistical decision theory (1) • Given input X in Rp, output Y in R • Joint distribution: Pr(X,Y) • Looking for predicting function: f(X) • Squared error loss: • Nearest-neighbor methods : min EPE ^

  18. k-Nearest neighbor If N,k , k/N 0 Insufficient samples! Curse of dimensionality! Linear model But, the true function might not be linear! Statistical decision theory (2)

  19. Statistical decision theory (3) • If • Robust • But, discontinuous in their derivatives ^

  20. Statistical decision theory (4) • G : categorical output variable • L : Loss Function • EPE = E[L(G, G(X))] • Bayesian Classifier ^

  21. References • Reading group on "elements of statistical learning” – overview.ppt • http://sifaka.cs.uiuc.edu/taotao/stat.html • Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf • http://www.stat.ohio-state.edu/~goel/STATLEARN/ • The Matrix Cookbook • http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf • A First Course in Probability

  22. 2.5 Local Methods in High Dimensions • With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. • The curse of dimensionality • To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. • The expected edge length = • All sample points are close to an edge of the sample. • Median distance from the origin to the closest data point:

  23. 2.5 Local Methods in High Dimensions • Example 1-NN vs. Linear • 1-NN • As p increases, MSE & bias tends to 1.0. • Linear model • Expecting on x0, the expected EPE increases linearly as a function of p. Sq. Bias Variance = 0. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.

  24. 2.6 Statistical Models, Supervised Learning and Function Approximation • Finding a useful approximation to function that underlies the predictive relationship between the inputs and outputs. • Supervised learning: machine learning point of view • Function approximation: mathematics and statistics point of view

  25. 2.7 Structured Regression Models • Nearest-neighbor and other local methods face problems in high dimensions. • They may be inappropriate even in low dimensions. • Need for structured approaches. • Difficulty of the problem • Infinitely many solutions to minimizing RSS. • Unique solution comes from restrictions on f.

  26. 2.8 Classes of Restricted Estimators • Methods categorized by the nature of the restrictions. • Roughness penalty and Bayesian methods • Penalizing functions that too rapidly vary over small regions of input space. • Kernel methods and local regression • Explicitly specifying the nature of local neighborhood (kernel function). • Need adaptation in high dimensions. • Basis functions and dictionary methods • Linear expansion of basis functions.

  27. 2.9 Model Selection and the Bias-Variance Tradeoff • All models have a smoothing or complexity parameter to be determined • Multiplier of the penalty term • Width of the kernel • Number of basis functions

  28. Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff!

  29. Bias-Variance tradeoff in kNN

  30. Model complexity High Bias Low Variance Low Bias High Variance Prediction Error Test error Training error Model Complexity High Low

More Related