Regression Analysis

1 / 81

Regression Analysis - PowerPoint PPT Presentation

Regression Analysis. CHEE209 Module 6. Outline . assessing systematic relationships types of models least squares estimation - assumptions fitting a straight line to data least squares parameter estimates graphical diagnostics quantitative diagnostics multiple linear regression

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Regression Analysis' - devon

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Regression Analysis

CHEE209 Module 6

J. McLellan

Outline
• assessing systematic relationships
• types of models
• least squares estimation - assumptions
• fitting a straight line to data
• least squares parameter estimates
• graphical diagnostics
• quantitative diagnostics
• multiple linear regression
• least squares parameter estimates
• diagnostics
• precision of parameter estimates, predicted responses

J. McLellan

The Scenario

We have been given a data set consisting of measurements of a number of variables

PLUS

• background information about the “process”
• objectives for the investigation
• information about how the experimentation was conducted
• e.g., shift, operating region, product line, ...

J. McLellan

Assessing Systematic Relationships

Is there a systematic relationship?

Two approaches:

• graphical
• quantitative

Other points -

• what is the nature of the relationship?
• linear in the “independent variables”
• nonlinear in the “independent variables”
• from engineering/scientific judgement - should there be a relationship?

J. McLellan

Assessing Systematic Relationships

Graphical Methods

• scatterplots (x-y diagrams)
• plot values of one variable against another
• look for evidence of a trend
• look for nature of trend - linear, quadratic, exponential, other nonlinearity?
• surface plots
• plot one variable against values of two other variables
• look for evidence of a trend - surface
• look for nature of trend - linear, nonlinear?
• casement plots
• a “matrix”, or table, of scatterplots

J. McLellan

Graphical Methods for Analyzing Data

Visualizing relationships between variables

Techniques

• scatterplots
• scatterplot matrices
• also referred to as “casement plots”

J. McLellan

Scatterplots

,,, are also referred to as “x-y diagrams”

• plot values of one variable against another
• look for systematic trend in data
• nature of trend
• linear?
• exponential?
• degree of scatter - does spread increase/decrease over range?
• indication that variance isn’t constant over range of data

J. McLellan

Scatterplots - Example
• tooth discoloration data - discoloration vs. fluoride

trend - possibly

nonlinear?

J. McLellan

Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing

signficant trend?

- doesn’t appear to

be present

J. McLellan

Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing

Variance appears

to decrease as

# of brushings increases

J. McLellan

Scatterplot matrices

… are a table of scatterplots for a set of variables

Look for -

• systematic trend between “independent” variable and dependent variables - to be described by estimated model
• systematic trend between supposedly independent variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically with statistical software, manually using spreadsheets

J. McLellan

Assessing Systematic Relationships

Quantitative Methods

• correlation
• formal def’n plus sample statistic (“Pearson’s r”)
• covariance
• formal def’n plus sample statistic

provide a quantiative measure of systematic LINEAR relationships

J. McLellan

Covariance

Formal Definition

• given two random variables X and Y, the covariance is
• E{ } - expected value
• sign of the covariance indicates the sign of the slope of the systematic linear relationship
• positive value --> positive slope
• negative value --> negative slope
• issue - covariance is SCALE DEPENDENT

J. McLellan

Covariance
• motivation for covariance as a measure of systematic linear relationship
• look at pairs of departures about the mean of X, Y

Y

Y

X

X

mean of X, Y

mean of X, Y

J. McLellan

strong linear relationship

with negative slope

strong linear relationship

with positive slope

Correlation
• is the “dimensionless” covariance
• divide covariance by standard dev’ns of X, Y
• formal definition
• properties
• dimensionless
• range

Note - the correlation gives NO information about the actual numerical value of the slope.

J. McLellan

Estimating Covariance, Correlation

… from process data (with N pairs of observations)

Sample Covariance

Sample Correlation

These quantitiesare statistics

J. McLellan

Example - Solder Thickness

Objective - study the effect of temperature on solder thickness

Data - in pairs

Solder Temperature (C) Solder Thickness (microns)

245 171.6

215 201.1

218 213.2

265 153.3

251 178.9

213 226.6

234 190.3

257 171

244 197.5

225 209.8

J. McLellan

Outline
• assessing systematic relationships
• types of models
• least squares estimation - assumptions
• fitting a straight line to data
• least squares parameter estimates
• graphical diagnostics
• quantitative diagnostics
• multiple linear regression
• least squares parameter estimates
• diagnostics
• precision of parameter estimates, predicted responses

J. McLellan

Empirical Modeling - Terminology
• response
• “dependent” variable - responds to changes in other variables
• the response is the characteristic of interest which we are trying to predict
• explanatory variable
• “independent” variable, regressor variable, input, factor
• these are the quantities that we believe have an influence on the response
• parameter
• coefficients in the model that describe how the regressors influence the response

J. McLellan

Models

When we are estimating a model from data, we consider the following form:

response

“random error”

parameters

explanatory

variables

J. McLellan

The Random Error Term
• is included to reflect fact that measured data contain variability
• successive measurements under the same conditions (values of the explanatory variables) are likely to be slightly different
• this is the stochastic component
• the functional form describes the deterministic component
• random error is not necessarily the result of mistakes in experimental procedures - reflects inherent variability
• “noise”

J. McLellan

Types of Models
• linear/nonlinear in the parameters
• linear/nonlinear in the explanatory variables
• number of response variables
• single response (standard regression)
• multi-response (or “multivariate” models)

From the perspective of statistical model-building,

the key point is whether the model is linear or

nonlinear in the PARAMETERS.

J. McLellan

Linear Regression Models
• linear in the parameters
• can be nonlinear in the regressors

J. McLellan

Nonlinear Regression Models
• nonlinear in the parameters
• e.g., Arrhenius rate expression

nonlinear

linear

(if E is fixed)

J. McLellan

Outline
• assessing systematic relationships
• types of models
• fitting a straight line to data
• least squares estimation - assumptions
• least squares parameter estimates
• graphical diagnostics
• quantitative diagnostics
• multiple linear regression
• least squares parameter estimates
• diagnostics
• precision of parameter estimates, predicted responses

J. McLellan

Fitting a Straight Line to Data

Consider the solder data -

Goal - predict solder thickness as a function of temperature

The trend appears

to be quite linear

--> try fitting a straight

line model to this data

Y - thickness

X - temperature

J. McLellan

Estimating a Model
• what is our measure for prediction?
• examine prediction error = measured - predicted value
• square the prediction error -- closer link to “distance”, and prevents cancellation by positive, negative values
• Least Squares Estimation

J. McLellan

Assumptions for Least Squares Estimation

Values of explanatory variables are known EXACTLY

• random error is strictly in the response variable
• practically - a random component will almost always be present in the explanatory variables as well
• we assume that this component has a substantially smaller effect on the response than the random component in the response

J. McLellan

Assumptions for Least Squares Estimation

The form of the equation provides an adequate representation for the data

• can test adequacy of model as a diagnostic

Variance of random error is CONSTANT over range of data collected

• e.g., variance of random fluctuations in thickness measurements at high temperatures is the same as variance at low temperatures

J. McLellan

Assumptions for Least Squares Estimation

The random fluctuations in each measurement are statistically independent from those of other measurements

• at same experimental conditions
• at other experimental conditions
• implies that random component has no “memory”
• no correlation between measurements

Random error term is normally distributed

• typical assumption
• not essential for least squares estimation
• important when determining confidence intervals, conducting hypothesis tests

J. McLellan

Least Squares Estimation - graphically

least squares - minimize sum of squared prediction errors

o

deterministic

“true”

relationship

response

(solder thickness)

o

o

o

o

o

T

prediction error

“residual”

J. McLellan

More Notation and Terminology

Random error is “independent, identically distributed”

(I.I.D) -- can say that it is IID Normal

Capitals - Y - denotes random variable

- except in case of explanatory variable - capital used

to denote formal def’n

Lower case - y, x - denotes measured values of

variables

Model

Measurement

J. McLellan

More Notation and Terminology

Estimate - denoted by “hat”

• examples - estimates of response, parameter

Residual - difference between measured and predicted response

J. McLellan

Least Squares Estimation

Find the parameter values that minimize the sum of squares of the residuals over the data set:Solution

• solve conditions for stationary point (“normal equations”)
• derivatives with respect to parameters = 0
• obtain analytical expressions for the least squares parameter estimates

J. McLellan

Least Squares Parameter Estimates

Note that the parameter estimates are functions of

BOTH the explanatory variable values and the measured

response values --> functions of “noisy data”

J. McLellan

Diagnostics - Graphical

Basic Principle - extract as much trend as possible from the data

Residuals should have no remaining trend -

• with respect to the explanatory variables
• with respect to the data sequence number
• with respect to other possible explanatory variables (“secondary variables”)
• with respect to predicted values

J. McLellan

Graphical Diagnostics

Residuals vs. Predicted Response Values

- even scatter

over range of prediction

- no discernable pattern

- roughly half the residuals

are positive, half negative

*

*

residual

ei

*

*

*

*

*

*

*

*

*

*

*

*

DESIRED RESIDUAL PROFILE

J. McLellan

Graphical Diagnostics

Residuals vs. Predicted Response Values

*

outlier lies outside

main body of residuals

residual

ei

*

*

*

*

*

*

*

*

*

*

*

*

*

RESIDUAL PROFILE WITH OUTLIERS

J. McLellan

Graphical Diagnostics

Residuals vs. Predicted Response Values

variance of the residuals

appears to increase

with higher predictions

*

residual

ei

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

NON-CONSTANT VARIANCE

J. McLellan

Graphical Diagnostics

Residuals vs. Explanatory Variables

• ideal - no systematic trend present in plot
• inadequate model - evidence of trend present

- need quadratic term in model

*

residual

ei

*

*

*

*

*

*

*

*

*

x

*

*

*

*

J. McLellan

Graphical Diagnostics

Residuals vs. Explanatory Variables Not in Model

• ideal - no systematic trend present in plot
• inadequate model - evidence of trend present

*

residual

ei

*

*

*

*

*

*

w

*

*

*

*

*

*

*

systematic trend

not accounted for in model

- include a linear term in “w”

J. McLellan

Graphical Diagnostics

Residuals vs. Order of Data Collection

*

failure to account for time trend

in data

residual

ei

*

*

*

*

*

*

*

t

*

*

*

*

*

*

residual

ei

successive random noise

components are correlated

- consider more complex model

- time series model for random

component?

*

*

*

*

*

*

*

*

t

*

*

*

*

*

J. McLellan

Quantitative Diagnostics - Ratio Tests

Is the variance of the residuals significant?

• relative to a benchmark
• indication of extent of unmodeled trend

Benchmark

• variance of inherent variation in process
• provided by variance of replicate runs if possible
• replicate runs - repeated runs at the same conditions which provide indication of inherent variation
• can conduct replicate runs at several sets of conditions and compare variances - are they constant over the experimental region?

J. McLellan

Quantitative Diagnostics - Ratio Tests

Residual Variance Ratio

Mean Squared Error of Residuals (Var. of Residuals):

J. McLellan

Quantitative Diagnostics - Ratio Tests

Is the ratio significant?

- compare to the F-distribution

Why?

• ratio is the ratio of sums of squared normal r.v.’s
• squared normal r.v.’s have a chi-squared distribution
• ratios of chi-squared r.v.’s have an F-distribution

Degrees of freedom

• number of statistically independent pieces of information used to calculate quantities
• degrees of freedom of MSE is N-2, where N is number of data points
• d. of f. for inherent variance is M-1, where M is number of data points used to estimate inherent variance

J. McLellan

Quantitative Diagnostics - Ratio Tests

Interpretation of Ratio

• if significant, then model fit is not adequate as the residual variation is large relative to the inherent variation
• “still some signal to be accounted for”

Example - Solder Thickness

• previous data - variance is 102.2 (24 degrees of freedom)
• residual variance (MSE)

J. McLellan

Quantitative Diagnostics - Ratio Tests

The ratio is:

Compare to

The residual variance is NOT statistically significant, and no evidence of inadequacy is detected.

5% of values occur outside

this fence

( area of tail is 0.05)

Fn1,n2

^

1.32

2.36

J. McLellan

Quantitative Diagnostics - Ratio Tests

Mean Square Regression Ratio

- is the variance described by the model significant relative to an indication of the inherent variation?

Variance described by model:

J. McLellan

Quantitative Diagnostics - Ratio Test

Test Ratio:

is compared against F1,N-2,0.95

Conclusions?

• ratio is statistically significant --> significant trend has been modeled
• ratio is NOT statistically significant --> significant trend has NOT been modeled, and model is inadequate in its present form

J. McLellan

Quantitative Diagnostics - Ratio Tests

Notes on MSR/MSE Ratio Test:

• MSE provides a rough indication of inherent variation
• use of MSE as indication of inherent variation assumes model form is adequate
• MSE contains effects of 1) background variation, 2) model specification error
• MSR/MSE ratio is frequently compared against F at the 75% level to guard agains erroneous rejection of an adequate model
• this is a “coarse” test of adequacy!

J. McLellan

Analysis of Variance Tables

The ratio tests involve dissection of the sum of squares:

{

J. McLellan

Quantitative Diagnostics - R2

Coefficient of Determination (“R2 Coefficient”)

• square of correlation between observed and predicted values:
• relationship to sums of squares:
• values typically reported in “%”, i.e., 100 R2
• ideal - R2 near 100%

J. McLellan

Example - Solder Data

correlation coefficient which has not

been squared

fairly high - accounting for trend

large ratio which is strongly

significant - we are picking up

significant trend

J. McLellan

Properties of the Parameter Estimates

Let’s look at the formal defn’s. of the parameter estimates:

x’s have been left in

lower case only to emphasize

the fact that they aren’t

random variables

J. McLellan

Properties of the Parameter Estimates

The eqns. for the parameter estimates are of the form:

i.e., linear combinations of random variables.

If Y’s are normally distributed, then linear combinations of y’s are normally distributed, and parameter estimates are normally distributed

The parameter estimates are STATISTICS

J. McLellan

Properties of the Parameter Estimates

Similarly,

Conclusion?

• the value expected on average for the least squares parameter estimates is the true value of the parameter
• if we repeated the data collection/model estimation exercise an infinite number of times, we would obtain the true parameter estimates “on average”

The least squares parameter estimates are UNBIASED

J. McLellan

Variance of the Parameter Estimates

Since the parameter estimates are unbiased,

Using the definitions for the parameter estimates:

is the variance

of the random

noise component

in the measurements

J. McLellan

Inferences - decisions - can be made about the true values of the parameters by taking into account the variation in the parameter estimates

• hypothesis tests
• confidence limits

The inference requires knowledge of the random behaviour - sampling behaviour - of the parameter estimate statistics

• distribution - for Normally distributed random components in the data, the parameter estimates are Normal random variables

J. McLellan

Inferences for Parameters

We follow exactly the same argument

• use the estimate of the random noise variance to estimate the variance of the parameter estimates using the expression for parameter estimate variance
• the true value of the parameter is the mean of the parameter estimate

J. McLellan

Confidence Intervals for Parameters

For intercept:

For slope:

Degrees of freedom for the t-distribution:

• comes from the degrees of freedom of the estimate of the random noise variance
• option 1 - use external estimate of noise variance - “inherent” variance that we had before
• option 2 - use mean square error of the residuals (MSE) - sometimes referred to as the “standard error”

J. McLellan

Example - Solder Thickness

Using the MSE to estimate the inherent noise variance:

J. McLellan

Example - Solder Thickness

95% Confidence Limits

For intercept:

For slope:

J. McLellan

Example - Solder Thickness

Interpretation -

• slope parameter is significantly non-zero
• intercept parameter is significantly non-zero
• retain both terms in the model

J. McLellan

Correlation of the Parameter Estimates

Note that

I.e., the parameter estimate for the intercept depends linearly on the slope!

• the slope and intercept estimates are correlated

changing slope changes

point of intersection with

axis because the line must

go through the centroid of the

data

J. McLellan

Getting Rid of the Covariance

Let’s define the explanatory variable as the deviation from its average:

- average of z is zero

Least Squares parameter

estimates:

- note that now there is no explicit

dependence on the slope value

in the intercept expression

J. McLellan

Getting Rid of the Covariance

In this form of the model, the slope and intercept parameter estimates are uncorrelated

Why is lack of correlation useful?

• allows indepedent decisions about parameter estimates
• decide whether slope is significant, intercept is significant individually
• “unique” assignment of trend
• intercept clearly associated with mean of y’s
• slope clearly associated with steepness of trend
• correlation can be eliminated by altering form of model, and choice of experimental points

J. McLellan

More background (not on exam)

Joint Probability Distributions

J. McLellan

Outline
• considering outcomes of random variables together
• discrete case - joint probability functions
• continuous case - joint density functions
• expected values - mean, variance, covariance
• covariance - measure of systematic linear relationships

J. McLellan

Considering Random Variables Jointly

In some instances, we may be interested in how several random quantities occur together - “jointly”

Examples:

Discrete -

• automobile colours - {red, blue, green, black, silver}
• automobile finish - {metallic, matte}
• jointly - {(red,matte), (red,metallic), (blue,matte), (blue, metallic), …}

Continuous -

• composition -- 0.1< composition<0.5 g/L
• temperature -- 300 C < temperature < 350 C
• jointly -- (0.1 < composition<0.5, 300<T<350)

J. McLellan

Considering Random Variables Jointly

Graphical summary - bivariate histogram

Frequency as

a function of

range in

each variable

J. McLellan

Joint Probabilities

In the joint situation, we consider events defined in terms of pairs of outcomes - one from each random variable.

We can summarize the probability of pairs of occurrences using probability function or probability density functions.

Note that in this instance, the functions are defined on the plane -- i.e., assigns probabilities to a pair of coordinates

J. McLellan

Joint Probabilities

Discrete Case - Joint probability function

e.g., car colour and finish:

pXY

metallic

matte

red

blue

J. McLellan

Joint Probabilities

Continuous Case - joint probability density function

and cumulative joint distribution function:

Example - bivariate Normal distribution function

J. McLellan

Joint Probabilities

Example - bivariate Normal distribution

J. McLellan

Recovering Individual Density Functions

We can “integrate out” the joint dependence, and recover the individual probability density functions

• move from occurrence of X AND Y to occurrence of X for any value of Y
• same interpretation for distribution of Y
• referred to as “marginal density functions”

J. McLellan

Expected Values

Given a function g(X,Y), we can define the expected value as:

Examples:

• g(X,Y) = X - recover mean of X
• mean of Y
• covariance of X and Y -- to be discussed in regression section

J. McLellan

Independence and Joint Distributions

Recall that independence implies that:

• similarly for continuous random variables, and cumulative distributions

This implies that

This plays a role for mean and variance of sample average.

J. McLellan