multiway data analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Multiway Data Analysis PowerPoint Presentation
Download Presentation
Multiway Data Analysis

Multiway Data Analysis

154 Views Download Presentation
Download Presentation

Multiway Data Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam

  2. The “future” science faculty of the Universiteit van Amsterdam

  3. The Biosystems Data Analysis group officially started in 2004 as a follow up of the process analysis group at the Universiteit van Amsterdam. Its aims are: Developing and validation of new data analysis methods for summarizing and visualizing complex structured biological data (Metabolomics / Proteomics).

  4. Three-way Data • Three-way Models • Three-way Applications

  5. Three-way Data

  6. Three-way data • Three-way data is a set of two-way matrices of the same objects and variables. • IR, Raman, NMR spectra of the same samples will not give a three-way data set, but a multi-block data set. IR Raman NMR

  7. Examples of three-way data UV Emission Time Chromato graphy Batch Process Fluorescence Samples Samples Batches Chromatogram Process variables Excitation RGB Judges Sensory Analysis Image Analysis Image Products Attributes Image

  8. From noway to multi-way 1 Scalar J J K K 1 J 1 1 4-way 1-way 1 L I I I J J K K J 5-way 2-way 1 L I I I J J J K K K 3-way M I I I

  9. Slabs and tubes Vertical tube Frontal slab Vertical slab Lateral tube Horizontal tube Horizontal slab

  10. Three slabs of fluorescence data5 Samples x 60 Excitation x 200 Emission Emission Fluorescence Samples Excitation

  11. time time batch process variable process variable Three-way batch process data • ‘Engineering’ process data i.e. temperature, pressure, flow rate • Spectroscopic process data i.e. NIR, Raman, UV-Vis One batch A series of batches X (JK) X (IJK)

  12. SBR batch process dataEngineering variables

  13. Spectroscopic three-way batch data 2 batch runs of a reaction followed with UV-Vis spectroscopy during 45 minutes

  14. Batch Fermentation in two steps: Threeway multiblock API Inoculum Batches Time Variables Fermentation Batches Time Variables

  15. Composition Conditions Composition What we measure Conditions What we want ... ... ... ... ... ... ... ... Four-way data in combinatorial catalysis

  16. Experiments Time Metabolites Experiments Time Gene expression Multiway data from the Omics age

  17. Three-way Models

  18. M.C. Escher: Some history Small problem with orthogonality

  19. More history • Psychometrics (1944-1980) • Catell 1944: Parallel Proportional profiles (Common factors fitted simultaneously to many data matrices). • Tucker 1964: Tucker models • Carroll & Chang 1970: Canonical Decomposition (CANDECOMP) • Harshman 1970: Parallel Factor Analysis (PARAFAC) • Chemistry • Ho 1978: Rank Annihilation (close to Parafac) on fluorescence data. • End 80’s beginning 90’s: Threeway methods to resolve LC-UV data.

  20. Multiway PCA:Unfolding of three-way data J K J JK K I I I J IK MacGregor Wold

  21. Two ways of unfoldingDifferent assumptions in MSPC • Wold • Nonlinear behavior in the data • Batch trajectories are monitored • Online monitoring • MacGregor • Nonlinearities removed • Whole batch is considered a measurement • Off-line monitoring

  22. Extension of SVD to Parafac VT X U v1T v2T = = + S u1 u2 b1 b2 B c1 c2 X A CT + = G = a1 a2

  23. Parafac / Candecomp • Parafac is not sequential • Need to re-estimate whole model when more components are calculated [no deflation]. • Parafac solution is unique • No rotational freedom • Changing parameters will reduce the fit. • NB! A PCA model is not unique • X = T*PT + E = T*R*R-1*PT + E = C*ST + E • Unique ≠ true

  24. Extension of Two Mode component Analysis (TMCA) P R CT G X A = P R B Q Q P CT X A G Tucker III P R = R

  25. Tucker models G • Tucker I, • Tucker II, • Tucker III Equals MPCA X A = CT G X A = B CT G X A =

  26. Tucker models • Core array can be fully filled • PxQxR triads (1,1,1 / 1,1,2 / 1,2,1 etc) • Not unique rotational freedom • Components can be rotated towards orthogonality. • Not sequential • Restricted Tucker models can be developed when using prior chemical knowledge

  27. Number of parameters • X(IxJxK) example I=50, J=9, K=100, • P = Q = R = 3 • Parafac: Rx(I + J + K) 477 • Tucker3: PxI + QxJ + RxK + PxQxR 504 • MPCA: Rx(I + JK) 2850 • Fit MPCA > Parafac (Overfit?)

  28. Soft models vs hard models • Two-way bilinear model: • Beer’s law • PCA • Trilinear model: • Parafac • Fluorescence No orthogonal constraints Orthogonal constraints No orthogonal constraints

  29. Multiway Regression I y Y X • Two step approach: Decomposition of X to A and model Regression of y on A Can be Parafac, Tucker, MPCA etc No information of Y is used in the decomposition Similar to PCR method

  30. Multiway Regression II y Y X • Direct approach Now X is decomposed with y in mind. This leads to a not optimal decomposition of X but an improved fit of y.

  31. Indicator variable Time When data are not exactly 3-way batch time process variable Time / Variable variable Indicator variable Time

  32. Alignment problems • Peakshifts in LCMS/GCMS • Warping methods to align the peaks • Dynamic Time Warping • Correlation optimized warping

  33. Three-way Applications

  34. Fluorescence data • 5 samples with varying concentration of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. • Excitation wavelength: 240 – 300 nm • Emission wavelength: 250 – 450 nm

  35. Unfold PCA model of Fluorescence data 99.97% explained with 3 PC’s Loadings refolded into Excitation / Emission form Overfit of data: Loading 2 has negative parts. This is not according fluorescence theory.

  36. Parafac model of Fluorescence data 99.93% explained variation: Good Fit Loadings are very well interpretable. Intensity in A mode can be related to concentration B and C mode A mode

  37. Fluorescence data Florescence data perfectly fits the trilinear model that is applied by Parafac Due to uniqueness property of Parafac, the loadings found will perfectly resemble the Emission spectra and Excitation spectra of the three compounds in de mixtures. This is a nice example of Mathematical chromatography

  38. Batch reaction monitoring • Pseudo-first-order reaction: A + BC D + E • UV-Vis spectrum (300-500nm) measured every 10 seconds. • Obeys Lambert-Beer law • 35 NOC batches. X (35  201  271) • In addition, some disturbed batches were measured • pH disturbance during the reaction • Temperature change • Impurity

  39. Aims and goals of research I • Data modelling: • Improve understanding of process by interpretation of model parameters • Analysis of historical batches: • Are the current process measurements able to distinguish between ‘good’ and ‘bad’ batches? • On-line monitoring: • Rapid fault detection • Easier fault diagnosis: what is the cause of the fault? • Prediction of batch duration

  40. Aims and goals of research II Which batch is different ?

  41. Unfold PCA model • Unfold keeping the batch direction (IxJK) X PT T E = +

  42. Unfold PCA model Many parameters estimated, likely to overfit the data

  43. Unrestricted Parafac model • The simplest three-way model is the PARAFAC model: C + = I B E X batch time A wavelengths

  44. Unrestricted Parafac model • Loadings are highly correlated - solution may be unstable. • Model is difficult to interpret. • 99.4% fit • Can external knowledge of the process be used to improve the model?

  45. Grey Modelling of batch data ‘Black-box’ or ‘soft’ models are empirical models which aim to fit the data as well as possible e.g. PCA, neural networks. ‘White’ or ‘hard’ models use known external knowledge of the process e.g. physicochemical model, mass-energy balances. + Easy to interpret Not always available Good fit Difficult to interpret Good fit ‘Grey’ or ‘hybrid’ models combine the two. University of Amsterdam

  46. Modelling batch data + + = white part black part E X Systematic variation due to known causes Systematic variation due to unknown causes Unsystematic variation Total variation

  47. Pure Spectra Reaction kinetics External information • Incorporating external information can • increase model interpretability • increase model stability

  48. Restricted ‘white’ model • External information is introduced in the form of parameter restrictions: KNOWN SPECTRA REACTION KINETICS C + = G B E X batch time A wavelengths LAMBERT-BEER LAW

  49. Restricted Tucker model • Model is stable. • 97.6% fit - lower than for black model • Some systematic variation in the data is left unexplained by this model.

  50. Grey model White components Black components describe known effects can be interpreted • 99.8% fit (corresponds well with estimated level of spectral noise of  0.13%)