1 / 33

Lecture 6: Linear Regression II

Lecture 6: Linear Regression II. Machine Learning CUNY Graduate Center. Extension to polynomial regression. Extension to polynomial regression. Polynomial regression is the same as linear regression in D dimensions. Generate new features. Standard Polynomial with coefficients, w. Risk.

Download Presentation

Lecture 6: Linear Regression II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center

  2. Extension to polynomial regression

  3. Extension to polynomial regression Polynomial regression is the same as linear regression in D dimensions

  4. Generate new features Standard Polynomial with coefficients, w Risk

  5. Generate new features Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from xi Then standard linear regression in D dimensions

  6. How is this still linear regression? • The regression is linear in the parameters, despite projecting xi from one dimension to D dimensions. • Now we fit a plane (or hyperplane) to a representation of xi in a higher dimensional feature space. • This generalizes to any set of functions

  7. Basis functions as feature extraction • These functions are called basis functions. • They define the bases of the feature space • Allows linear decomposition of any type of function to data points • Common Choices: • Polynomial • Gaussian • Sigmoids • Wave functions (sine, etc.)

  8. Training data vs. Testing Data • Evaluating the performance of a classifier on training data is meaningless. • With enough parameters, a model can simply memorize (encode) every training point • To evaluate performance, data is divided into training and testing (or evaluation) data. • Training data is used to learn model parameters • Testing data is used to evaluate performance

  9. Overfitting

  10. Overfitting

  11. Overfitting performance

  12. Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting, and a bad model?

  13. Possible detection of overfitting • Stability • An appropriately fit model is stable under different samples of the training data • An overfit model generates inconsistent performance • Performance • A good model has low test error • A bad model has high test error

  14. What is the optimal model size? • The best model size generalizes to unseen data the best. • Approximate this by testing error. • One way to optimize parameters is to minimize testing error. • This operation uses testing data as tuning or development data • Sacrifices training data in favor of parameter optimization • Can we do this without explicit evaluation data?

  15. Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models

  16. Linear Regression Identify the best parameters, w, for a regression function

  17. Overfitting • Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. • Often caused by too many parameters relative to the amount of training data. • E.g. an order-N polynomial can intersect any N+1 data points

  18. Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian

  19. Regularization In a linear regression model overfitting is characterized by large weights.

  20. Penalize large weights Regularized Regression (L2-Regularization or Ridge Regression) Introduce a penalty term in the loss function.

  21. Regularization Derivation

  22. Regularization in Practice

  23. Regularization Results

  24. More regularization • The penalty term defines the styles of regularization • L2-Regularization • L1-Regularization • L0-Regularization • L0-norm is the optimal subset of features

  25. Curse of dimensionality • Increasing dimensionality of features increases the data requirements exponentially. • For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. • Models should be small relative to the amount of available data • Dimensionality reduction techniques – feature selection – can help. • L0-regularization is explicit feature selection • L1- and L2-regularizations approximate feature selection.

  26. Bayesians v. Frequentists • What is a probability? • Frequentists • A probability is the likelihoodthat an event will happen • It is approximated by the ratio of the number of observed events to the number of total events • Assessment is vital to selecting a model • Point estimates are absolutely fine • Bayesians • A probability is a degree of believability of a proposition. • Bayesians require that probabilities be prior beliefs conditioned on data. • The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. • If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior

  27. Bayesian Linear Regression • The previous MLE derivation of linear regression uses point estimates for the weight vector, w. • Bayesians say, “hold it right there”. • Use a prior distribution over w to estimate parameters • Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. • Now optimize:

  28. Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.

  29. Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.

  30. Optimize the Bayesian posterior Ignoring terms that do not depend on w IDENTICAL formulation as L2-regularization

  31. Context • Overfitting is bad. • Bayesians vs. Frequentists • Is one better? • Machine Learning uses techniques from both camps.

  32. Next Time Logistic Regression

More Related