Outline

Outline • Ordinary least squares regression • Ridge regression Data mining and statistical learning, lecture 3

Ordinary least squares regression (OLS) y Model: Terminology: 0: intercept (or bias) 1, …, p: regression coefficients (or weights) … x1 x2 xp The response variable responds directly and linearly to changes in the inputs Data mining and statistical learning, lecture 3

Least squares regression Assume that we have observed a training set of data Estimate the  coefficients by minimizing the residual sum of squares Data mining and statistical learning, lecture 3

Matrix formulation of OLS regression Differentiating the residual sum of squares and setting the first derivatives equal to zero we obtain where and Data mining and statistical learning, lecture 3

Parameter estimates and predictions Least squares estimates of the parameters Predicted values Data mining and statistical learning, lecture 3

Different sources of inputs Quantitative inputs Transformations of quantitative inputs Numeric or dummy coding of the levels of qualitative inputs Interactions between variables (e.g. X3 = X1X2) Example of dummy coding: Data mining and statistical learning, lecture 3

An example of multiple linear regression Response variable: Requested price of used Porsche cars (1000 SEK) Inputs: X1 = Manufacturing year X2 = Milage (km) X3 = Model (0 or 1) X4 = Equipment (1 2, 3) X5 = Colour (Red Black Silver Blue Black White Green) Data mining and statistical learning, lecture 3

Price of used Porsche cars Response variable: Requested price of used Porsche cars (1000 SEK) Inputs: X1 = Manufacturing year X2 = Milage (km) Data mining and statistical learning, lecture 3

Interpretation of multiple regression coefficients Assume that and that the regression coefficients are estimated by ordinary least squares regression Then the multiple regression coefficient represents the additional contribution of xj on y, after xj has been adjusted for x0, x1, …, xj-1, xj+1, …, xp Data mining and statistical learning, lecture 3

Confidence intervals for regression parameters Assume that where the X-variables are fixed and the error terms are i.i.d. and N(0, ) Then where vj is the jth diagonal element of Data mining and statistical learning, lecture 3

Interpretation of software outputs Adding new independent variables to a regression model alters at least one of the old regression coefficients unless the columns of the X-matrix are orthogonal, i.e. Data mining and statistical learning, lecture 3

Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ... Classical statistical model selection techniques are model-based. In data-mining the model selection is data-driven. The p-value refers to a t-test of the hypothesis that the regression coefficient of the last entered x-variable is zero Data mining and statistical learning, lecture 3

Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...- model validation by visual inspection of residuals Residual = Observed - Predicted Data mining and statistical learning, lecture 3

The Gram-Schmidt procedure for regression by successive orthogonalization and simple linear regression 1. Intialize z0 = x0 = 1 2. For j = 1, … , p, compute where  depicts the inner product (the sum of coordinate-wise products) 3. Regress y on zp to obtain the multiple regression coefficient Data mining and statistical learning, lecture 3

Prediction of a response variable using correlated explanatory variables- daily temperatures in Stockholm, Göteborg, and Malmö Data mining and statistical learning, lecture 3

Absorbance records for ten samples of chopped meat 1 response variable (protein) 100 predictors (absorbance at 100 wavelengths or channels) The predictors are strongly correlated to each other Data mining and statistical learning, lecture 3

Absorbance records for 240 samples of chopped meat The target is poorly correlated to each predictor Data mining and statistical learning, lecture 3

Ridge regression The ridge regression coefficients minimize a penalized residual sum of squares: or Normally, inputs are centred prior to the estimation of regression coefficients Data mining and statistical learning, lecture 3

Matrix formulation of ridge regression for centred inputs If the inputs are orthogonal, the ridge estimates are just a scaled version of the least squares estimates Shrinking enables estimation of regression coefficients even if the number of parameters exceeds the number of cases Figure 3.7 Data mining and statistical learning, lecture 3

Ridge regression – pros and cons Ridge regression is particularly useful if the explanatory variables are strongly correlated to each other. The variance of the estimated regression coefficient is reduced at the expensive of (slightly) biased estimates Data mining and statistical learning, lecture 3

The Gauss-Markov theorem Consider a linear regression model in which: • the inputs are regarded as fixed • the error terms are i.i.d. with mean 0 and variance2. Then, the least squares estimator of a parameter aT has variance no bigger than any other linear unbiased estimator of aT Biased estimators may have smaller variance and mean squared error! Data mining and statistical learning, lecture 3

SAS code for an ordinary least squares regression procreg data=mining.dailytemperature outest = dtempbeta; model daily_consumption = stockholm g_teborg malm_; run; Data mining and statistical learning, lecture 3

SAS code for ridge regression procreg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1; model daily_consumption = stockholm g_teborg malm_; procprint data=dtempbeta; run; Data mining and statistical learning, lecture 3

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: