140 Views

Download Presentation
##### Regression analysis

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Regression analysis**Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data**Typical examples**• Spectroscopy: Predict chemistry from spectral measurements • Product development: Relating sensory to chemistry data • Marketing: Relating sensory data to consumer preferences**Topics covered**• Simple linear regression • The selectivity problem: a reason why multivariate methods are needed • The collinearity problem: a reason why data compression is needed • The outlier problem: why and how to detect**Simple linear regression**• One y and one x. Use x to predict y. • Use a linear model/equation and fit it by least squares**Data structure**X-variable Y-variable 2 4 1 . . . 7 6 8 . . . Objects, same number in x and y-column**Least squares (LS) used**for estimation of regression coefficients y y=b0+b1x+e b1 b0 x Simple linear regression**Model**Regression analysis Data (X,Y) Future X Prediction Regression analysis Interpretation Outliers? Pre-processing**The selectivity problem**A reason why multivariate methods are needed**Multiple linear regression**• Provides • predicted values • regression coefficients • diagnostics • If there are many highly collinear variables • unstable regression equations • difficult to interpret coefficients: many and unstable**Collinearity, the problem of correlated X-variable**y=b0+b1x1+b2x2+e Regression in this case is fitting a plane to the data (open circles) The two x’s have high correlation Leads to unstable equation/plane (in the direction with little variability)**Possible solutions**• Select the most important wavelengths/variables (stepwise methods) • Compress the variables to the most dominating dimensions (PCR, PLS) • We will concentrate on the latter (can be combined)**Data compression**• We will first discuss the situation with one y-variable • Focus on ideas and principles • Provides regression equation (as above) and plots for interpretation**Model for data compression methods**X=TPT+E Centred X and y y=Tq+f T-scores, carrier of information from X to y P,q –loadings E,f – residuals (noise)**x3**PCA to compress data x2 ti x1 y q t-score Regression by data compression PC1 Regression on scores**x1**x2 MLR y x3 x4 x1 t1 x2 PCR y t2 x3 x4 x1 t1 y x2 PLS x3 t2 x4**PCR and PLS**For each factor/component • PCR • Maximize variance of linear combinations of X • PLS • Maximize covariance between linear combinations of X and y Each factor is subtracted before the next is computed**Principal component regression (PCR)**• Uses principal components • Solves the collinearity problem, stable solutions • Provides plots for interpretation (scores and loadings) • Well understood • Outlier diagnostics • Easy to modify • But uses only X to determine components**PLS-regression**• Easy to compute • Stable solutions • Provides scores and loadings • Often less number of componentsthan PCR • Sometimes better predictions**PCR and PLS for several Y-variables**• PCR is computed for each Y. Each Y is regressed onto the principal components • PLS: The algorithm is easily modified. Maximises linear combinations of X and Y. • For both methods: Regression equations and plots**Validation is important**• Measure quality of the predictor • Determine A – number of components • Compare methods**Prediction testing**Calibration Estimate coefficients Testing/validation Predict y, use the coefficients**Calibrate, find y=f(x)**estimate coefficients Predict y, use the coefficients Cross-validation**Validation**• Compute • Plot RMSEP versus component • Choose the number of components with best RMSEP properties • Compare for different methods**RMSEP**MLR NIR calibration of protein in wheat. 6 NIR wavelengths 12 calibration samples, 26 test samples**Estimation error**Model error Conceptual illustration of important phenomena**Prediction vs. cross-validation**• Prediction testing: Prediction ability of the predictor at hand. Requires much data. • Cross-validation: Property of the method. Better for smaller data set.**Validation**• One should also plot measured versus predicted y-value • Correlation can be computed, but can sometimes be misleading**Example, plot of y versus predicted y**Plot of measured and predicted protein NIR calibration**Outlier detection**• Instrument error or noise • Drift of signal (over time) • Misprints • Samples outside normal range (different population)**Outlier detection**• Outliers can be detected because • Model for spectral data (X=TPT+E) • Model for relationship between X and y (y=Tq+f)**Outlier detectiontools**• Residuals • X and y-residuals • X-residuals as before, y-residual is difference between measured and predicted y • Leverage • hi