1 / 33

Regression analysis

Regression analysis. Relating two data matrices/tables to each other. Purpose: prediction and interpretation. Y-data. X-data. Typical examples. Spectroscopy: Predict chemistry from spectral measurements Product development: Relating sensory to chemistry data

anatole
Download Presentation

Regression analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data

  2. Typical examples • Spectroscopy: Predict chemistry from spectral measurements • Product development: Relating sensory to chemistry data • Marketing: Relating sensory data to consumer preferences

  3. Topics covered • Simple linear regression • The selectivity problem: a reason why multivariate methods are needed • The collinearity problem: a reason why data compression is needed • The outlier problem: why and how to detect

  4. Simple linear regression • One y and one x. Use x to predict y. • Use a linear model/equation and fit it by least squares

  5. Data structure X-variable Y-variable 2 4 1 . . . 7 6 8 . . . Objects, same number in x and y-column

  6. Least squares (LS) used for estimation of regression coefficients y y=b0+b1x+e b1 b0 x Simple linear regression

  7. Model Regression analysis Data (X,Y) Future X Prediction Regression analysis Interpretation Outliers? Pre-processing

  8. The selectivity problem A reason why multivariate methods are needed

  9. Can be used for several Y’s also

  10. Multiple linear regression • Provides • predicted values • regression coefficients • diagnostics • If there are many highly collinear variables • unstable regression equations • difficult to interpret coefficients: many and unstable

  11. Collinearity, the problem of correlated X-variable y=b0+b1x1+b2x2+e Regression in this case is fitting a plane to the data (open circles) The two x’s have high correlation Leads to unstable equation/plane (in the direction with little variability)

  12. Possible solutions • Select the most important wavelengths/variables (stepwise methods) • Compress the variables to the most dominating dimensions (PCR, PLS) • We will concentrate on the latter (can be combined)

  13. Data compression • We will first discuss the situation with one y-variable • Focus on ideas and principles • Provides regression equation (as above) and plots for interpretation

  14. Model for data compression methods X=TPT+E Centred X and y y=Tq+f T-scores, carrier of information from X to y P,q –loadings E,f – residuals (noise)

  15. x3 PCA to compress data x2 ti x1 y q t-score Regression by data compression PC1 Regression on scores

  16. x1 x2 MLR y x3 x4 x1 t1 x2 PCR y t2 x3 x4 x1 t1 y x2 PLS x3 t2 x4

  17. PCR and PLS For each factor/component • PCR • Maximize variance of linear combinations of X • PLS • Maximize covariance between linear combinations of X and y Each factor is subtracted before the next is computed

  18. Principal component regression (PCR) • Uses principal components • Solves the collinearity problem, stable solutions • Provides plots for interpretation (scores and loadings) • Well understood • Outlier diagnostics • Easy to modify • But uses only X to determine components

  19. PLS-regression • Easy to compute • Stable solutions • Provides scores and loadings • Often less number of componentsthan PCR • Sometimes better predictions

  20. PCR and PLS for several Y-variables • PCR is computed for each Y. Each Y is regressed onto the principal components • PLS: The algorithm is easily modified. Maximises linear combinations of X and Y. • For both methods: Regression equations and plots

  21. Validation is important • Measure quality of the predictor • Determine A – number of components • Compare methods

  22. Prediction testing Calibration Estimate coefficients Testing/validation Predict y, use the coefficients

  23. Calibrate, find y=f(x) estimate coefficients Predict y, use the coefficients Cross-validation

  24. Validation • Compute • Plot RMSEP versus component • Choose the number of components with best RMSEP properties • Compare for different methods

  25. RMSEP MLR NIR calibration of protein in wheat. 6 NIR wavelengths 12 calibration samples, 26 test samples

  26. Estimation error Model error Conceptual illustration of important phenomena

  27. Prediction vs. cross-validation • Prediction testing: Prediction ability of the predictor at hand. Requires much data. • Cross-validation: Property of the method. Better for smaller data set.

  28. Validation • One should also plot measured versus predicted y-value • Correlation can be computed, but can sometimes be misleading

  29. Example, plot of y versus predicted y Plot of measured and predicted protein NIR calibration

  30. Outlier detection • Instrument error or noise • Drift of signal (over time) • Misprints • Samples outside normal range (different population)

  31. Outlier detection • Outliers can be detected because • Model for spectral data (X=TPT+E) • Model for relationship between X and y (y=Tq+f)

  32. Outlier detectiontools • Residuals • X and y-residuals • X-residuals as before, y-residual is difference between measured and predicted y • Leverage • hi

More Related