Metabolomics a Promising ‘omics Science. By Susan Simmons University of North Carolina Wilmington. Collaborators. Dr. David Banks, Duke Dr. Chris Beecher, University of Michigan Dr. Xiaodong Lin, University of Cincinnati Dr. Young Truong, UNC Dr. Jackie Hughes-Oliver, NC State
By Susan Simmons
University of North Carolina Wilmington
Genomics – 25,000Genes
Transcriptomics – 100,000Transcripts
Proteomics – 1,000,000Proteins
Metabolomics – 1,800 CompoundsMetabolomics
To obtain data, a tissue sample is taken from a patient. Then:
-grouping related ions
No Interpretation Interface
The sample prep involves stabilizing the sample, adding spiked-in calibrants, and creating multiple aliquots (some are frozen) for QC purposes. This is roboticized.
Sources of error in this step include:
The result of this is a set of m/z ratios and timestamps for each ion, which can be viewed as a 2-D histogram in the m/z x time plane.
One now estimates the amount of each metabolite. This entails normalization, which also introduces error.
The caveats pointed out in Baggerley et al. (Proteomics, 2003) apply.
Let z be the vector of raw data, and let x be the estimates. Then the measurement equation is:
G(z) = x = µ+ ε
where µis the vector of unknown true values and εis decomposable into separate components.
For metabolite i, the estimate Xiis:
gi(z) = lnΣ wij∫∫sm(z) – c(m,t)dm dt.
The law of propagation of error (this is essentially the delta method) says that the variance in X is about
Σni=1 (∂g /∂ zi)2 Var[zi] +
Σi≠k 2 (∂g/∂zi)(∂g/∂zk) Cov[zi, zk]
The weights depend upon the values of the spiked in calibrants, so this gets complicated.
Cross-platform experiments are also crucial for medical use. This leads to key comparison designs. Here the same sample (or aliquots of a standard solution or sample) are sent to multiple labs. Each lab produces its spectrogram.
It is impossible to decide which lab is best, but one can estimate how to adjust for interlab differences.
The Mandel bundle-of-lines model is what we suggest for interlaboratory comparisons. This assumes:
Xik = αi + βiθk + εik
where Xik is the estimate at lab i for metabolite k, θk is the unknown true quantity of metabolite k, and
εik ~ N(0,σik2).
To solve the equations given values from the labs, one must impose constraints. A Bayesian can put priors on the laboratory coefficients and the error variance.
Metabolomics needs a multivariate version, with models for the rates at which compounds volatilize.
Dealing with missing values
Prediction and Classification
We had abundance data on 317 metabolites from 63 subjects. Of these, 32 were healthy, 22 had ALS but were not on medication, and 9 had ALS and were taking medication.
The goal was to classify the two ALS groups and the healthy group.
Here p>n. Also, some abundances were below detectability.
Using the Breiman-Cutler code for Random Forests, the out-of-bag error rate was 7.94%; 29 of the ALS patients and 29 of the healthy patients were correctly classified.
20 of the 317 metabolites were important in the classification, and three were dominant.
RF can detect outliers via proximity scores. There were four such.
Several support vector machine approaches were tried on this data:
The SCAD SVM had the best loo error rate, 14.3%.
Robust SVD (Liu et al., 2003) is used to simultaneously cluster patients (rows) and metabolites (columns). Given the patient by metabolite matrix X, one writes
Xik = ri ck + εik
where ri and ck are row and column effects. Then one can sort the array by the effect magnitudes.
To do a rSVD use alternating L1 regression, without an intercept, to estimate the row and column effects. First fit the row effect as a function of the column effect, and then reverse. Robustness stems from not using OLS.
Doing similar work on the residuals gives the second singular value solution.