1 / 62

Lecture 3: Linear Regression

Lecture 3: Linear Regression. Machine Learning CUNY Graduate Center. Today. Calculus Lagrange Multipliers Linear Regression. Optimization with constraints. What if I want to constrain the parameters of the model. The mean is less than 10

zagiri
Download Presentation

Lecture 3: Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3: Linear Regression Machine Learning CUNY Graduate Center

  2. Today • Calculus • Lagrange Multipliers • Linear Regression

  3. Optimization with constraints • What if I want to constrain the parameters of the model. • The mean is less than 10 • Find the best likelihood, subject to a constraint. • Two functions: • An objective function to maximize • An inequality that must be satisfied

  4. Lagrange Multipliers Find maxima of f(x,y) subject to a constraint.

  5. General form Maximizing: Subject to: Introduce a new variable, and find a maxima.

  6. Example Maximizing: Subject to: Introduce a new variable, and find a maxima.

  7. Example Now have 3 equations with 3 unknowns.

  8. Example Eliminate Lambda Substitute and Solve

  9. Basics of Linear Regression • Regression algorithm • Supervised technique. • In one dimension: • Identify • In D-dimensions: • Identify • Given: training data: • And targets:

  10. Graphical Example of Regression ?

  11. Graphical Example of Regression

  12. Graphical Example of Regression

  13. Definition Where w is a vector of weights which define the D parameters of the model In linear regression, we assume that the model that generates the data involved only a linear combination of input variables.

  14. Evaluation • How can we evaluate the performance of a regression solution? • Error Functions (or Loss functions) • Squared Error • Linear Error

  15. Regression Error

  16. Empirical Risk Empirical risk is the measure of the loss from data. By minimizing risk on the training data, we optimize the fit with respect to the loss function

  17. Model Likelihood and Empirical Risk • Two related but distinct ways to look at a model. • Model Likelihood. • “What is the likelihood that a model generated the observed data?” • Empirical Risk • “How much error does the model have on the training data?”

  18. Model Likelihood Assuming Independently Identically Distributed (iid) data.

  19. Understanding Model Likelihood Substitution for the eqn of a gaussian Apply a log function Let the log dissolve products into sums

  20. Understanding Model Likelihood Optimize the weights. (Maximum Likelihood Estimation) Log Likelihood Empirical Risk w/ Squared Loss Function

  21. Maximizing Log Likelihood (1-D) Find the optimal settings of w.

  22. Maximizing Log Likelihood Partial derivative Set to zero Separate the sum to isolate w0

  23. Maximizing Log Likelihood Partial derivative Set to zero Separate the sum to isolate w0

  24. Maximizing Log Likelihood From previous partial From prev. slide Substitute Isolate w1

  25. Maximizing Log Likelihood Clean and easy. Or not… Apply some linear algebra.

  26. Likelihood using linear algebra Representing the linear regression function in terms of vectors.

  27. Likelihood using linear algebra Representation as vectors Stack the data into a matrix and use the Norm operation to handle the sum Stack xT into a matrix of data points, X.

  28. Likelihood in multiple dimensions This representation of risk has no inherent dimensionality.

  29. Maximum Likelihood Estimation redux Decompose the norm FOIL – linear algebra style Differentiate Combine terms Isolate w

  30. Extension to polynomial regression

  31. Extension to polynomial regression Polynomial regression is the same as linear regression in D dimensions

  32. Generate new features Standard Polynomial with coefficients, w Risk

  33. Generate new features Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from xi Then standard linear regression in D dimensions

  34. How is this still linear regression? The regression is linear in the parameters, despite projecting xi from one dimension to D dimensions. Now we fit a plane (or hyperplane) to a representation of xi in a higher dimensional feature space. This generalizes to any set of functions

  35. Basis functions as feature extraction • These functions are called basis functions. • They define the bases of the feature space • Allows linear decomposition of any type of function to data points • Common Choices: • Polynomial • Gaussian • Sigmoids • Wave functions (sine, etc.)

  36. Training data vs. Testing Data • Evaluating the performance of a classifier on training data is meaningless. • With enough parameters, a model can simply memorize (encode) every training point • To evaluate performance, data is divided into training and testing (or evaluation) data. • Training data is used to learn model parameters • Testing data is used to evaluate performance

  37. Overfitting

  38. Overfitting

  39. Overfitting performance

  40. Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting,and a bad model?

  41. Possible detection of overfitting • Stability • An appropriately fit model is stable under different samples of the training data • An overfit model generates inconsistent performance • Performance • A good model has low test error • A bad model has high test error

  42. What is the optimal model size? • The best model size generalizes to unseen data the best. • Approximate this by testing error. • One way to optimize parameters is to minimize testing error. • This operation uses testing data as tuning or development data • Sacrifices training data in favor of parameter optimization • Can we do this without explicit evaluation data?

  43. Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models

  44. Break Coffee. Stretch.

  45. Linear Regression Identify the best parameters, w, for a regression function

  46. Overfitting • Recall: overfittinghappens when a model is capturing idiosyncrasies of the data rather than generalities. • Often caused by too many parameters relative to the amount of training data. • E.g. an order-N polynomial can intersect any N+1 data points

  47. Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian

  48. Regularization In a linear regression model overfitting is characterized by large weights.

  49. Penalize large weights Regularized Regression (L2-Regularization or Ridge Regression) Introduce a penalty term in the loss function.

  50. Regularization Derivation

More Related