1 / 27

Lecture3 – Overview of Supervised Learning

Lecture3 – Overview of Supervised Learning. Rice ELEC 697 Farinaz Koushanfar Fall 2006. Summary. Variable types and terminology Two simple approaches to prediction Linear model and least squares Nearest neighbor methods Statistical decision theory Curse of dimensionality

yestin
Download Presentation

Lecture3 – Overview of Supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006

  2. Summary • Variable types and terminology • Two simple approaches to prediction • Linear model and least squares • Nearest neighbor methods • Statistical decision theory • Curse of dimensionality • Structured regression model • Classes of restricted estimators • Reading (ch2, ELS)

  3. Variable Types and Terminology • Input/output variables • Quantitative • Qualitative (categorical, discrete, factors) • Ordered categorical • Regression: quantitative output • Classification: qualitative output (numeric code) • Terminology: • X: input, Y: regression output, G: classification output • xi: The i-th value of X (either scalar or vector) • Ŷ: prediction of Y, Ĝ: prediction of G

  4. Two Approaches to Prediction (1) Linear model (OLS) • Given a vector X=(X1,..,Xp), predict Y via: • , with intercept: • Least square method: • Differentiate w.r.t : • For XTX nonsingular: (2) Nearest Neighbor (NN) • The k-NN for Ŷ is:

  5. Example

  6. Example - Linear Model • Example: output G is either GREEN or Red • The two classes are separated by a linear decision boundary • Possible data scenarios: 1- Gaussian, uncorrelated, same variance, diff mean 2- Each class is a mixture of 10 different Gaussians

  7. Example – 15-Nearest Neighbor • The k-NN for Ŷ is: • Nk(x) is the neighborhood of x that has k points • The classification rule is the majority voting among the neighbors of Nk(x)

  8. Example – 1-Nearest Neighbor • 1-NN classification, no points misclassified • OLS had 3 parameters, does the NN have 1 (i.e. k)? • Indeed, k-NN uses N/k effective num of parameters

  9. Example - Data Scenario • Data scenario in the previous example: • Density for each class: mixture of 10 Gaussians • Green points: 10 means from N((1,0)T,I) • Red points: 10 means from N((0,1)T,I) • Variance was 0.2 for both sets • See the book website for actual data • The Bayes Error is the best possible performance

  10. From OLS to NN… • Many modern modeling procedures are variants of OLS or k-NN • Kernel smoothers • Local linear regression • Local basis expansion • Projection pursuit and neural networks

  11. Statistical Decision Theory • Case 1 – quantitative output Y • Xp: a real-valued random input vector • L(Y,f(X)): loss function for penalizing the prediction error • Most common form of L is the least square loss: L(Y,f(X)) = (Y-f(X))2 • Criterion for choosing f: EPE(f) = E(Y-f(X))2 • The solution is: f(X) = E(Y|X=x) • This is also known as the regression function

  12. Statistical Decision Theory (Cont’d) • Case 2 – qualitative output Y • Prediction rule is Ĝ(X), where G and Ĝ(X) take values in G, |G|=K • L(k,l): loss function for classifying Gk as Gl • Single unit misclassification: 0-1 loss • The expected prediction error is: EPE = E [L(G, Ĝ(X))] • The solution is:

  13. Statistical Decision Theory (Cont’d) • Case 2 – qualitative output Y (cont’d) • With 0-1 loss function, the solution is: • Or, simply • This is Bayes Classifier: pick the class having the maximum probability at point x

  14. Further Discussions • k-NN uses conditional expectations directly by • Approximating expectations by simple averages • Relaxing conditioning at a point to a region around the point • As N,k, s.t. k/N0, the k-NN estimate: f*(X)E(Y|X=x) and therefore is consistent! • OLS assumes a linear structure form for f(X)=XT, and minimizes sample version of EPE directly • As sample size grows, our estimate for coefficients converges to the optimal linear: opt = E(XTX)-1E(XTY) • Model is limited by the linearity assumption

  15. Example - Bayes Classifier • Question: how did we build the classifier for our simulation example?

  16. Curse of Dimensionality • k-NN becomes difficult in higher dimensions: • It becomes difficult to gather k points close to x0 • NN become spatially large and estimates are biased • Reducing the spatial size of the neighborhood means reducing k  the variance of neighborhood increases

  17. Example 1 – Curse of Dimensionality • Sampling density proportional to N1/p • If 100 points sufficient to estimate function in 1, 10010 needed for the same accuracy in 10 Example 1: • 1000 training points xi, generated uniformly on [-1,1] • (no measurement error) • Training set: T, use 1-NN to predict y0 at point x0 • This is mean squared error (MSE) for estimating f(0) • MSE(x0) =

  18. Example 1 – Curse of Dimensionality • Bias-variance decomposition:

  19. Example 2 – Curse of Dimensionality • If linear model is correct, or almost correct, the NN will do much worst than OLS • Assuming that we know this is the case, simple OLS is not affected by the dimension

  20. Statistical Models • Y=f(X)+ (X and  independent) • The random additive error , where E()=0 • Pr(Y|X) depends on X via conditional mean f(X) = E(Y|X=x) • Approximation to the truth, all unmeasured variables in  • N realizations: yi=f(xi)+ i, (i and j independent) • Generally more complicated, e.g. Var(Y|X)=2(X) • Additive errors not used with qualitative response • E.g. binary trials, E(Y|X=x)=p(x) &Var(Y|X=x)=p(x)[1-p(x)] • For qualitative, directly model:

  21. Function Approximation • The approximation has a set of parameters • E.g f(x)=xT (=), or f(x)=k hk(x) k • Estimate  by min RSS()= i (yi- f(xi))2 • Assumes parametric form for f and loss function • More general principle: Maximum Likelihood (ML) • E.g. A random sample yi, i=1,..,N from a density Pr(y) • The log prob of the sample is: • E.g. Multinomial qualitative likelihood

  22. Example – Least Square Function Approximation

  23. Structured Regression Models • Any Function passing thru (xi,yi) has RSS=0 • Need to restrict the class • Usually the restrictions impose local behavior • Any method that attempts to approximate locally varying function is “cursed” • Any method that overcomes the curse, assumes an implicit metric that does not allow neighborhood to be simultaneously small in all directions

  24. Classes of Restricted Estimators Some of the classes of restricted methods that we cover: • Roughness penalty and Bayesian methods • Kernel methods and local regression • E.g. for k-NN, Kk(x,x0)=I(||x-x0||||x(k)-x0||)

  25. Model Selection and Bias-Variance Trade-offs • Many of the flexible methods have a smoothing or a complexity parameter • The multiplier of the penalty term • The width of the kernel • Or the number of basis functions • Cannot use RSS to determine this parameter – always interpolating learn data • Use prediction error on unseen test cases to guide us • Generally, as the model complexity increases, the variance is increased and the squared bias is decreased (and vice versa) • Choose model complexity to trade bias with variance error s.t. to minimize the test error

  26. Example – Bias-Variance Trade-offs • k-NN on data, Y=f(X)+, E()=0, Var()=2 • For nonrandom samples xi, test (generalization) error will be:

  27. Bias-Variance Trade-offs • More generally, as the model complexity of our procedure increases, the variance tends to increase and the square bias tends to decrease • The opposite behavior occurs as model complexity is decreased • In k-NN the model complexity controlled by k • Choose your model complexity to trade-off variance with bias

More Related