- 132 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Regression analysis' - anatole

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Regression analysis

Relating two data matrices/tables to each other

Purpose: prediction and interpretation

Y-data

X-data

Typical examples

- Spectroscopy: Predict chemistry from spectral measurements
- Product development: Relating sensory to chemistry data
- Marketing: Relating sensory data to consumer preferences

Topics covered

- Simple linear regression
- The selectivity problem: a reason why multivariate methods are needed
- The collinearity problem: a reason why data compression is needed
- The outlier problem: why and how to detect

Simple linear regression

- One y and one x. Use x to predict y.
- Use a linear model/equation and fit it by least squares

for estimation of regression coefficients

y

y=b0+b1x+e

b1

b0

x

Simple linear regression

Regression analysis

Data (X,Y)

Future X

Prediction

Regression analysis

Interpretation

Outliers?

Pre-processing

A reason why multivariate methods are needed

Multiple linear regression

- Provides
- predicted values
- regression coefficients
- diagnostics
- If there are many highly collinear variables
- unstable regression equations
- difficult to interpret coefficients: many and unstable

Collinearity, the problem of correlated X-variable

y=b0+b1x1+b2x2+e

Regression in this case is fitting a

plane to the data (open circles)

The two x’s have high correlation

Leads to unstable equation/plane

(in the direction with little variability)

Possible solutions

- Select the most important wavelengths/variables (stepwise methods)
- Compress the variables to the most dominating dimensions (PCR, PLS)
- We will concentrate on the latter (can be combined)

Data compression

- We will first discuss the situation with one y-variable
- Focus on ideas and principles
- Provides regression equation (as above) and plots for interpretation

Model for data compression methods

X=TPT+E

Centred X and y

y=Tq+f

T-scores, carrier of information from X to y

P,q –loadings

E,f – residuals (noise)

PCR and PLS

For each factor/component

- PCR
- Maximize variance of linear combinations of X
- PLS
- Maximize covariance between linear combinations of X and y

Each factor is subtracted before the next is computed

Principal component regression (PCR)

- Uses principal components
- Solves the collinearity problem, stable solutions
- Provides plots for interpretation (scores and loadings)
- Well understood
- Outlier diagnostics
- Easy to modify
- But uses only X to determine components

PLS-regression

- Easy to compute
- Stable solutions
- Provides scores and loadings
- Often less number of componentsthan PCR
- Sometimes better predictions

PCR and PLS for several Y-variables

- PCR is computed for each Y. Each Y is regressed onto the principal components
- PLS: The algorithm is easily modified. Maximises linear combinations of X and Y.
- For both methods: Regression equations and plots

Validation is important

- Measure quality of the predictor
- Determine A – number of components
- Compare methods

Validation

- Compute
- Plot RMSEP versus component
- Choose the number of components with best RMSEP properties
- Compare for different methods

MLR

NIR calibration of protein in wheat. 6 NIR wavelengths

12 calibration samples, 26 test samples

Prediction vs. cross-validation

- Prediction testing: Prediction ability of the predictor at hand. Requires much data.
- Cross-validation: Property of the method. Better for smaller data set.

Validation

- One should also plot measured versus predicted y-value
- Correlation can be computed, but can sometimes be misleading

Outlier detection

- Instrument error or noise
- Drift of signal (over time)
- Misprints
- Samples outside normal range (different population)

Outlier detection

- Outliers can be detected because
- Model for spectral data (X=TPT+E)
- Model for relationship between X and y (y=Tq+f)

Outlier detectiontools

- Residuals
- X and y-residuals
- X-residuals as before, y-residual is difference between measured and predicted y
- Leverage
- hi

Download Presentation

Connecting to Server..