Regression analysis

1 / 33

# Regression analysis - PowerPoint PPT Presentation

Regression analysis. Relating two data matrices/tables to each other. Purpose: prediction and interpretation. Y-data. X-data. Typical examples. Spectroscopy: Predict chemistry from spectral measurements Product development: Relating sensory to chemistry data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Regression analysis' - anatole

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Regression analysis

Relating two data matrices/tables to each other

Purpose: prediction and interpretation

Y-data

X-data

Typical examples
• Spectroscopy: Predict chemistry from spectral measurements
• Product development: Relating sensory to chemistry data
• Marketing: Relating sensory data to consumer preferences
Topics covered
• Simple linear regression
• The selectivity problem: a reason why multivariate methods are needed
• The collinearity problem: a reason why data compression is needed
• The outlier problem: why and how to detect
Simple linear regression
• One y and one x. Use x to predict y.
• Use a linear model/equation and fit it by least squares
Data structure

X-variable

Y-variable

2

4

1

.

.

.

7

6

8

.

.

.

Objects, same number

in x and y-column

Least squares (LS) used

for estimation of regression coefficients

y

y=b0+b1x+e

b1

b0

x

Simple linear regression

Model

Regression analysis

Data (X,Y)

Future X

Prediction

Regression analysis

Interpretation

Outliers?

Pre-processing

The selectivity problem

A reason why multivariate methods are needed

Multiple linear regression
• Provides
• predicted values
• regression coefficients
• diagnostics
• If there are many highly collinear variables
• unstable regression equations
• difficult to interpret coefficients: many and unstable

Collinearity, the problem of correlated X-variable

y=b0+b1x1+b2x2+e

Regression in this case is fitting a

plane to the data (open circles)

The two x’s have high correlation

(in the direction with little variability)

Possible solutions
• Select the most important wavelengths/variables (stepwise methods)
• Compress the variables to the most dominating dimensions (PCR, PLS)
• We will concentrate on the latter (can be combined)
Data compression
• We will first discuss the situation with one y-variable
• Focus on ideas and principles
• Provides regression equation (as above) and plots for interpretation

Model for data compression methods

X=TPT+E

Centred X and y

y=Tq+f

T-scores, carrier of information from X to y

E,f – residuals (noise)

x3

PCA

to compress data

x2

ti

x1

y

q

t-score

Regression by data compression

PC1

Regression on scores

x1

x2

MLR

y

x3

x4

x1

t1

x2

PCR

y

t2

x3

x4

x1

t1

y

x2

PLS

x3

t2

x4

PCR and PLS

For each factor/component

• PCR
• Maximize variance of linear combinations of X
• PLS
• Maximize covariance between linear combinations of X and y

Each factor is subtracted before the next is computed

Principal component regression (PCR)
• Uses principal components
• Solves the collinearity problem, stable solutions
• Well understood
• Outlier diagnostics
• Easy to modify
• But uses only X to determine components
PLS-regression
• Easy to compute
• Stable solutions
• Often less number of componentsthan PCR
• Sometimes better predictions
PCR and PLS for several Y-variables
• PCR is computed for each Y. Each Y is regressed onto the principal components
• PLS: The algorithm is easily modified. Maximises linear combinations of X and Y.
• For both methods: Regression equations and plots
Validation is important
• Measure quality of the predictor
• Determine A – number of components
• Compare methods

Prediction testing

Calibration

Estimate coefficients

Testing/validation

Predict y, use the

coefficients

Calibrate, find y=f(x)

estimate coefficients

Predict y, use the coefficients

Cross-validation
Validation
• Compute
• Plot RMSEP versus component
• Choose the number of components with best RMSEP properties
• Compare for different methods

RMSEP

MLR

NIR calibration of protein in wheat. 6 NIR wavelengths

12 calibration samples, 26 test samples

Estimation error

Model error

Conceptual illustration of important phenomena

Prediction vs. cross-validation
• Prediction testing: Prediction ability of the predictor at hand. Requires much data.
• Cross-validation: Property of the method. Better for smaller data set.
Validation
• One should also plot measured versus predicted y-value
• Correlation can be computed, but can sometimes be misleading

Example, plot of y versus predicted y

Plot of measured and predicted protein

NIR calibration

Outlier detection
• Instrument error or noise
• Drift of signal (over time)
• Misprints
• Samples outside normal range (different population)
Outlier detection
• Outliers can be detected because
• Model for spectral data (X=TPT+E)
• Model for relationship between X and y (y=Tq+f)
Outlier detectiontools
• Residuals
• X and y-residuals
• X-residuals as before, y-residual is difference between measured and predicted y
• Leverage
• hi