Statistics in Metabolomics. David Banks ISDS Duke University. 1. Background. Metabolomics is the next step after genomics and proteomics. There are about 25,000 genes, most of which have unknown functions. There are about 1,000,000 proteins, most of which are unstudied.
Metabolomics is the next step after genomics and proteomics.
There are about 25,000 genes, most of which have unknown functions.
There are about 1,000,000 proteins, most of which are unstudied.
Metabolites are low-weight molecular compounds produced in the course of processing raw materials.
These are produced in metabolic pathways, such as the Krebs (citrate) cycle for oxidation of glucose.
This gives metabolomics an edge.
Biochemical Profile Map to
There is less raw information than for other *omics, but more context.
To obtain data, a tissue sample is taken from a patient. Then:
Sources of error in this step include:
Sources of error in this step include:
Some laboratories use MALDI-TOF equipment, and the error sources are slightly different.
One now estimates the amount of each metabolite. This entails normalization, which also introduces error.
The caveats pointed out in Baggerley et al. (Proteomics, 2003) apply.
The classical NIST approach to this is to:
See Cameron, “Error Analysis,” ESS Vol. 9, 1982.
G(z) = x = µ+ ε
where µis the vector of unknown true values and εis decomposable into separate components.
For metabolite i, the estimate Xiis:
gi(z) = lnΣ wij∫∫sm(z) – c(m,t)dm dt.
Σni=1 (∂g /∂ zi)2 Var[zi] +
Σi≠k 2 (∂g/∂zi)(∂g/∂zk) Cov[zi, zk]
The weights depend upon the values of the spiked in calibrants, so this gets complicated.
It is impossible to decide which lab is best, but one can estimate how to adjust for interlab differences.
Xik = αi + βiθk + εik
where Xik is the estimate at lab i for metabolite k, θk is the unknown true quantity of metabolite k, and
εik ~ N(0,σik2).
Metabolomics needs a multivariate version, with models for the rates at which compounds volatilize.
We plan to use this model to compare the Metabolon lab in RTP to Chris Newgard’s lab at Duke.
A classic problem in proteomics is to locate peaks and estimate their area or volume.
Unlike proteomics, metabolite peak location is mostly known. So Bayesian methods seem good (cf. Clyde and House). Metabolon uses proprietary software.
Different tools are appropriate for different kinds of metabolomic studies. The work we have done focuses on:
The goal was to classify the two ALS groups and the healthy group.
Here p>n. Also, some abundances were below detectability.
20 of the 317 metabolites were important in the classification, and three were dominant.
RF can detect outliers via proximity scores. There were four such.
The SCAD SVM had the best loo error rate, 14.3%.
Minb,wΣ[1 – yi(b+wTxi)]+ + λΣ | wk |
where the first sum is over n and the second is over p.
SCAD replaces the L1 penalty with a nonconvex penalty.
A further multiple tree analysis with FIRMPlusTM software from the GoldenHelix Co. did not achieve good classification.
So Random Forests wins. And the selected metabolites make sense.
Xik = ri ck + εik
where ri and ck are row and column effects. Then one can sort the array by the effect magnitudes.
Doing similar work on the residuals gives the second singular value solution.
The NIH wanted to decide whether amniotic fluid samples from women in preterm labor could support classification:
As before, Random Forests gave the best results. The various SVMs were about 5-10% less predictive.
The main information was contained in amino acids and carbohydrates.
Term Inflamm. No Inf.
Term 39 1 0
True Inflamm. 7 32 1
No Inf. 2 2 29
RF accuracy was 100/113 = 88.49%.
For those who had preterm delivery without inflammation, both amino acids and carbohydrates were low.
For those who had inflammation, the carbohydrates were very low and the amino acids were high.