Understanding Linear Regression in Machine Learning and Data Mining

+ Machine Learning and Data MiningLinear regression (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)

Linear regression “Predictor”: Evaluate line: return r • Define form of function f(x) explicitly • Find a good f(x) within that family 40 Target y 20 0 0 10 20 Feature x (c) Alexander Ihler

26 26 24 24 y 22 22 20 20 30 30 40 40 20 20 30 30 x1 20 20 x2 10 10 10 10 0 0 0 0 More dimensions? y x1 x2 (c) Alexander Ihler

Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler

Measuring error Error or “residual” Observation Prediction 0 0 20 (c) Alexander Ihler

Mean squared error • How can we quantify the error? • Could choose something else, of course… • Computationally convenient (more later) • Measures the variance of the residuals • Corresponds to likelihood under Gaussian model of “noise” (c) Alexander Ihler

MSE cost function • Rewrite using matrix form (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler

Visualizing the cost function J(θ) θ 0 θ1 (c) Alexander Ihler

Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)

Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “surface”: error residual for that θ… (c) Alexander Ihler

+ Machine Learning and Data MiningLinear regression: direct minimization (adapted from) Prof. Alexander Ihler

MSE Minimum • Consider a simple problem • One feature, two data points • Two unknowns: µ0, µ1 • Two equations: • Can solve this system directly: • However, most of the time, m > n • There may be no linear function that hits all the data exactly • Instead, solve directly for minimum of MSE function (c) Alexander Ihler

SSE Minimum • Reordering, we have • X (XT X)-1 is called the “pseudo-inverse” • If XT is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler

Matlab SSE • This is easy to solve in Matlab… % y = [y1 ; … ; ym] % X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’ (c) Alexander Ihler

5 4 18 3 16 2 14 1 0 162 cost for this one datum Heavy penalty for large errors 12 5 -20 -15 -10 -5 0 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Effects of MSE choice • Sensitivity to outliers (c) Alexander Ihler

L1 error L2, original data 18 L1, original data 16 L1, outlier data 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Something else entirely… Cost functions for regression (MSE) (MAE) (???) “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler

+ Machine Learning and Data MiningLinear regression: nonlinear features (adapted from) Prof. Alexander Ihler

Nonlinear functions • What if our hypotheses are not lines? • Ex: higher-order polynomials (c) Alexander Ihler

Nonlinear functions • Single feature x, predict target y: • Sometimes useful to think of “feature transform” Add features: Linear regression in new features (c) Alexander Ihler

Higher-order polynomials • Fit in the same way • More “features”

Features • In general, can use any features we think are useful • Other information about the problem • Sq. footage, location, age, … • Polynomial functions • Features [1, x, x2, x3, …] • Other functions • 1/x, sqrt(x), x1 * x2, … • “Linear regression” = linear in the parameters • Features we can make as complex as we want! (c) Alexander Ihler

Higher-order polynomials • Are more features better? • “Nested” hypotheses • 2nd order more general than 1st, • 3rd order ““ than 2nd, … • Fits the observed data better

Complex model Overfitting and complexity • More complex models will always fit the training data better • But they may “overfit” the training data, learning complex relationships that are not really present Simple model Y Y X X (c) Alexander Ihler

Test data • After training the model • Go out and get more data from the world • New observations (x,y) • How well does our model perform? Training data New, “test” data (c) Alexander Ihler

Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 30 Training data 25 New, “test” data 20 15 • 0th to 1st order • Error decreases • Underfitting • Higher order • Error increases • Overfitting 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Training versus test error Mean squared error Polynomial order (c) Alexander Ihler

Understanding Linear Regression in Machine Learning and Data Mining

Understanding Linear Regression in Machine Learning and Data Mining

Presentation Transcript

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data mining and Machine Learning

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)