# Linear Regression Models - PowerPoint PPT Presentation

1 / 34

Linear Regression Models. Andy Wang CIS 5930-03 Computer Systems Performance Analysis. Linear Regression Models. What is a (good) model? Estimating model parameters Allocating variation Confidence intervals for regressions Verifying assumptions visually. What Is a (Good) Model?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Linear Regression Models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Linear Regression Models

Andy Wang

CIS 5930-03

Computer Systems

Performance Analysis

### Linear Regression Models

• What is a (good) model?

• Estimating model parameters

• Allocating variation

• Confidence intervals for regressions

• Verifying assumptions visually

### What Is a (Good) Model?

• For correlated data, model predicts response given an input

• Model should be equation that fits data

• Standard definition of “fits” is least-squares

• Minimize squared error

• Keep mean error zero

• Minimizes variance of errors

### Least-Squared Error

• If then error in estimate for xi is

• Minimize Sum of Squared Errors (SSE)

• Subject to the constraint

### Estimating Model Parameters

• Best regression parameters are

where

• Note error in book!

### Parameter EstimationExample

• Execution time of a script for various loop counts:

• = 6.8, = 2.32, xy = 88.7, x2 = 264

• b0 = 2.32  (0.30)(6.8) = 0.28

### Allocating Variation

• If no regression, best guess of y is

• Observed values of y differ from , giving rise to errors (variance)

• Regression gives better guess, but there are still errors

• We can evaluate quality of regression by allocating sources of errors

### The Total Sum of Squares

• Without regression, squared error is

### The Sum of Squaresfrom Regression

• Recall that regression error is

• Error without regression is SST

• So regression explains SSR = SST - SSE

• Regression quality measured by coefficient of determination

### Evaluating Coefficientof Determination

• Compute

• Compute

• Compute

• where R = R(x,y) = correlation(x,y)

### Example of Coefficientof Determination

• For previous regression example

• y = 11.60, y2 = 29.88, xy = 88.7,

• SSE = 29.88-(0.28)(11.60)-(0.30)(88.7) = 0.028

• SST = 29.88-26.9 = 2.97

• SSR = 2.97-.03 = 2.94

• R2 = (2.97-0.03)/2.97 = 0.99

### Standard Deviationof Errors

• Variance of errors is SSE divided by degrees of freedom

• DOF is n2 because we’ve calculated 2 regression parameters from the data

• So variance (mean squared error, MSE) is SSE/(n2)

• Standard dev. of errors is square root:(minor error in book)

### Checking Degreesof Freedom

• Degrees of freedom always equate:

• SS0 has 1 (computed from )

• SST has n1 (computed from data and , which uses up 1)

• SSE has n2 (needs 2 regression parameters)

• So

### Example ofStandard Deviation of Errors

• For regression example, SSE was 0.03, so MSE is 0.03/3 = 0.01 and se = 0.10

• Note high quality of our regression:

• R2 = 0.99

• se = 0.10

### Confidence Intervals for Regressions

• Regression is done from a single population sample (size n)

• Different sample might give different results

• True model is y = 0 + 1x

• Parameters b0 and b1 are really means taken from a population sample

### Calculating Intervalsfor Regression Parameters

• Standard deviations of parameters:

• Confidence intervals arewhere t has n - 2 degrees of freedom

• Not divided by sqrt(n)

### Example of Regression Confidence Intervals

• Recall se = 0.13, n = 5, x2 = 264, = 6.8

• So

• Using 90% confidence level, t0.95;3 = 2.353

### Regression Confidence Example, cont’d

• Thus, b0 interval is

• Not significant at 90%

• And b1 is

• Significant at 90% (and would survive even 99.9% test)

### Confidence Intervalsfor Predictions

• Previous confidence intervals are for parameters

• How certain can we be that the parameters are correct?

• Purpose of regression is prediction

• How accurate are the predictions?

• Regression gives mean of predicted response, based on sample we took

### Predicting m Samples

• Standard deviation for mean of future sample of m observations at xpis

• Note deviation drops as m 

• Variance minimal at x =

• Use t-quantiles with n–2 DOF for interval

### Example of Confidenceof Predictions

• Using previous equation, what is predicted time for a single run of 8 loops?

• Time = 0.28 + 0.30(8) = 2.68

• Standard deviation of errors se = 0.10

• 90% interval is then

### Verifying Assumptions Visually

• Regressions are based on assumptions:

• Linear relationship between response y and predictor x

• Or nonlinear relationship used in fitting

• Predictor x nonstochastic and error-free

• Model errors statistically independent

• With distribution N(0,c) for constant c

• If assumptions violated, model misleading or invalid

### Testing Linearity

• Scatter plot x vs. y to see basic curve type

Linear

Piecewise Linear

Outlier

Nonlinear (Power)

### TestingIndependence of Errors

• Scatter-plot i versus

• Should be no visible trend

• Example from our curve fit:

### More on Testing Independence

• May be useful to plot error residuals versus experiment number

• In previous example, this gives same plot except for x scaling

• No foolproof tests

• “Independence” test really disproves particular dependence

• Maybe next test will show different dependence

### Testing for Normal Errors

• Prepare quantile-quantile plot of errors

• Example for our regression:

### Testing for Constant Standard Deviation

• Tongue-twister: homoscedasticity

• Look for trend in spread

• Example:

• Regression throws away some information about the data

• To allow more compact summarization

• Sometimes vital characteristics are thrown away

• Often, looking at data plots can tell you whether you will have a problem

IIIIIIIV

xyxyxyxy

108.04109.14107.4686.58

86.9588.1486.7785.76

137.58138.741312.7487.71

98.8198.7797.1188.84

118.33119.26117.8188.47

149.96148.10148.8487.04

67.2466.1366.0885.25

44.2643.1045.391912.50

1210.84129.13128.1585.56

74.8277.2676.4287.91

55.6854.7455.7386.89

### What Does Regression Tell Us About These Data Sets?

• Exactly the same thing for each!

• N = 11

• Mean of y = 7.5

• Y = 3 + .5 X

• Standard error of regression is 0.118

• All the sums of squares are the same

• Correlation coefficient = .82

• R2 = .67

I

II

III

IV