Biostatistics Case Studies 2010. Session 5: Microarray Statistics. Peter D. Christenson Biostatistician http://gcrc. LABioMed.org /Biostat. Case Study. A compound found in red grapes improves the health and lifespan of mice on a high calorie diet. Treatment Groups.
Session 5: Microarray Statistics
Peter D. Christenson
A compound found in red grapes improves the health and lifespan of mice on a high calorie diet.
What statistical analysis was done here?
What statistical analyses?
Fourteen Microarray “Experiments”: each of 5+5+4 mice had a separate array run for ~40,000 genes.
First 2 SD mice. 12 others →
38,348 rows: each a gene
HCR over-expressed, compared to HC.
How were results for (a) and (b) calculated?
HCR under-expressed, compared to HC.
Suppose we compare the mean of 5 appropriately scaled #s for a gene’s expression with the mean of 5 in another group, using a t-test.
So, we need ~ 2SD difference in gene expression to be fairly sure (80%) of detecting this gene with only N=5+5.
This is a large effect – see next slide.
2SD Effect corresponds to 50th→ 97th percentile, about 2/5 of normal range
So, how can we try to avoid missing genes that are important, but are not detected with p<0.05?
Recall that p<0.05 corresponds to approximately:
|t| =|effect/SE(effect)| = |Δ/SE(Δ)| = |signal/noise| >2
where noise is a function of ~ SD/ sqrt(N).
Here, SD is the SD among the expressions for 5 mice in a group.
How can we “reduce SD”? Isn’t it natural subject-to-subject heterogeneity, a characteristic of the population?
This SD is among measured expression, which includes both array-to-array error and subject-to-subject heterogeneity. (Confounded-there is no internal control.)
We try to statistically remove some of the inherent array-to-array error through normalization.
Raw expression is normalized within each array by z-scores on log(expression).
The Z-Ratio is the difference between the mean z-score of 4 HCR mice to the mean of 5 HC mice (which is the numerator for the z-test), divided by the SD of these differences over different genes.
Use raw data to generate results for the most up-regulated gene.
Two Sample T-Test for HCR vs. HC on Gene Hsd3b5
N Mean SD SE
HCR 4 1.136 0.634 0.32
HC 5 -0.555 0.107 0.048
95% CI for Diff: ( 1.02, 2.362)
T-Test T = 5.96 P = 0.0006
Antilog(1.691) =~ 5.42 fold greater HCR expression
“Z-Ratio” = Diff of logs/SD = 1.691/0.14 = 11.99
Here, SD=0.14 is among these diffs over genes.
Suppose the decision rule is to declare a particular gene important if its mean expression in HCR mice differs enough from that for HC mice so that p<0.05:
Significantly less → down-regulated.
Significantly more → up-regulated.
Then the expected number of identified genes among, say, 38,000 that are not affected (false positives) is:
0.05*38,000 = 1900
Thus, confirmatory analyses such as PCR are needed.