280 likes | 423 Views
This guide explores the fundamentals of linear regression, a supervised learning technique used in machine learning. It covers essential concepts such as features, targets, predictions, and the learning algorithm. The focus is on optimizing parameters to minimize prediction errors, with an emphasis on the Mean Squared Error (MSE) cost function. The guide also examines the potential for using nonlinear features and higher-order polynomials to improve model accuracy. Discover how to visualize cost functions and the implications of overfitting in the context of regression analysis.
E N D
+ Machine Learning and Data MiningLinear regression (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA
Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)
Linear regression “Predictor”: Evaluate line: return r • Define form of function f(x) explicitly • Find a good f(x) within that family 40 Target y 20 0 0 10 20 Feature x (c) Alexander Ihler
26 26 24 24 y 22 22 20 20 30 30 40 40 20 20 30 30 x1 20 20 x2 10 10 10 10 0 0 0 0 More dimensions? y x1 x2 (c) Alexander Ihler
Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler
Measuring error Error or “residual” Observation Prediction 0 0 20 (c) Alexander Ihler
Mean squared error • How can we quantify the error? • Could choose something else, of course… • Computationally convenient (more later) • Measures the variance of the residuals • Corresponds to likelihood under Gaussian model of “noise” (c) Alexander Ihler
MSE cost function • Rewrite using matrix form (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler
Visualizing the cost function J(θ) θ 0 θ1 (c) Alexander Ihler
Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)
Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “surface”: error residual for that θ… (c) Alexander Ihler
+ Machine Learning and Data MiningLinear regression: direct minimization (adapted from) Prof. Alexander Ihler
MSE Minimum • Consider a simple problem • One feature, two data points • Two unknowns: µ0, µ1 • Two equations: • Can solve this system directly: • However, most of the time, m > n • There may be no linear function that hits all the data exactly • Instead, solve directly for minimum of MSE function (c) Alexander Ihler
SSE Minimum • Reordering, we have • X (XT X)-1 is called the “pseudo-inverse” • If XT is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler
Matlab SSE • This is easy to solve in Matlab… % y = [y1 ; … ; ym] % X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’ (c) Alexander Ihler
5 4 18 3 16 2 14 1 0 162 cost for this one datum Heavy penalty for large errors 12 5 -20 -15 -10 -5 0 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Effects of MSE choice • Sensitivity to outliers (c) Alexander Ihler
L1 error L2, original data 18 L1, original data 16 L1, outlier data 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler
Something else entirely… Cost functions for regression (MSE) (MAE) (???) “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler
+ Machine Learning and Data MiningLinear regression: nonlinear features (adapted from) Prof. Alexander Ihler
Nonlinear functions • What if our hypotheses are not lines? • Ex: higher-order polynomials (c) Alexander Ihler
Nonlinear functions • Single feature x, predict target y: • Sometimes useful to think of “feature transform” Add features: Linear regression in new features (c) Alexander Ihler
Higher-order polynomials • Fit in the same way • More “features”
Features • In general, can use any features we think are useful • Other information about the problem • Sq. footage, location, age, … • Polynomial functions • Features [1, x, x2, x3, …] • Other functions • 1/x, sqrt(x), x1 * x2, … • “Linear regression” = linear in the parameters • Features we can make as complex as we want! (c) Alexander Ihler
Higher-order polynomials • Are more features better? • “Nested” hypotheses • 2nd order more general than 1st, • 3rd order ““ than 2nd, … • Fits the observed data better
Complex model Overfitting and complexity • More complex models will always fit the training data better • But they may “overfit” the training data, learning complex relationships that are not really present Simple model Y Y X X (c) Alexander Ihler
Test data • After training the model • Go out and get more data from the world • New observations (x,y) • How well does our model perform? Training data New, “test” data (c) Alexander Ihler
Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 30 Training data 25 New, “test” data 20 15 • 0th to 1st order • Error decreases • Underfitting • Higher order • Error increases • Overfitting 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Training versus test error Mean squared error Polynomial order (c) Alexander Ihler