生物晶片數據分析對近代統計方法之影響

生物晶片數據分析對近代統計方法之影響 提升大學基礎教育計畫生物數學研討會東海大學數學系 6/30/2004 15:30-17:30 臺灣大學數學系陳宏

授課大網 • Introduction to Microarray Experiment: • Impact on statistical science • Large p small n problem • 前置處理 • 處理生物晶片由不同來源變異造成的誤差項，進行cDNA array數據正規化（normalization） • Modeling • Residual analysis in regression • 基因選取（Multiple Hypothesis Testing）及分類（Classification; Support Vector Machine） • Multiple hypotheses testing: Challenge on Neyman-Pearson paradigm • Support Vector Machine: regularization

文獻 • Bibliography on Microarray Data Analysis • http://www.nslij-genetics.org/microarray/ • BioConductor: open source software for bioinformatics • http://www.bioconductor.org/ • IPAM Long Programs: • Functional Genomics. 9/18 – 12/15, 2000. • Functional Genomics Tutorials. • Expression Arrays Technologies and Methods of Analysis. • Expression Arrays, Genetic Networks and Disease. • Mathematical and Statistical Challenges from Computational Biology. • http://www.ipam.ucla.edu/programs/yearly.aspx?year=2000 • Proteomics: Sequence, Structure, Function. 3/8 – 6/11, 2004. • IMA Workshop: • Probability and Statistics in Complex Systems: Genomics, Networks, and Financial Engineering, September 1, 2003 - June 30, 2004 • http://www.ima.umn.edu/complex/fall/c1.html

Statistical Issues in cDNA Microarray Analysis Image Analysis Identify spot area and extract intensities for each spot Normalization Normalizing dye effects, slide effects, etc. Downstream Analysis Clustering and classification Assess Expression Level Replicates and hierarchical models

分子生物學中心教條 (Central Dogma)

核酸序列(DNA sequence)攜帶遺傳訊息之生命之書 ……..ATCGGTGCGTGCATGCAGTGCAGTGCATGCAACCGTATATTAATCCCACTGTTTAAAACTGGTTCATCAGAATTTATATTTTTTTCTTTCCTCCCTTTTGAATTTTACTTATGACAGAGGAAGTATTGACCCATGACTTTTTAAACATAATTTATATTTATACTGGTCAATAATGAAGGTTTTTTTTTATTATTAAA Adenine (腺嘌呤) Guanine (鳥糞嘌呤) Cytosine (胞嘧啶) Thymine (胸腺嘧啶)

Gene Expression Studies • Puzzle: • Different cell types in a multicelluar organism have the same genome. • Those cells can have drastically different shapes and structures. • Explanation: • The expression levels of their genes can be very different. • Within the same cell, tightly regulated gene expression is also essential for various processes such as proper response to intercelluar signals, cell division and cell differentiation. • Most biological phenomena are caused by an ensemble of cooperating biochemical entities including mRNA, proteins, small molecules (such as hormones) or ions. • Genes encode proteins or RNAs.

“Big Picture” Biology • What are all the components and processes taking place in a cell? • How do these components and processes interact to sustain life? • One approach: What happens to the entire cell when one particular gene/process is perturbed? • Traditional approach: • Genes or gene products were classified whether they had a dominant effect on a given endpoint in a certain phenotypic assay. • In cancer research, all experimental tools are biased towards the identification of dominant oncogenes, and the question of non-dominant cooperating oncogenes has been largely ignored. • Gene expression studies were carried out in a gene by gene manner. • Lack understanding of expression profile on a global (genome-wide) scale.

DNA-array Technology • Global expression studies: Use them to assign biological functions. • Allow massively parallel measurements in one experiment. • Monitor the expression levels of a large number of genes simultaneously. • It is manageable for a single scientist to measure within a few months the expression level of all genes of a given organism, such as the approximately 6000 genes of yeast, in a time-dependent manner during the cell cycle. • Idea: • It is an assay that uses the specificity of DNA/RNA hybridization to measure the concentrations of a large number of genes simultaneously. • Use bait DNA that is immobilized on a solid surface (e.g. glass, nylon, composite) to “attract” its complementary mRNA molecule. • Generally, each gene corresponds to a tiny “spot" on the surface of the microarray.

DNA-array Technology • Idea: • Messenger RNAs (mRNA) are extracted from the cell culture. • Complementary DNAs (cDNA) are generated from the RNAs. • Amplified • Labeled • Hybridized to a large array of DNA probes • The array is scanned by a laser to obtain the fluorescent signals for each probe region. • From the signal strengths of the probes from a particular gene, one can infer the expression level of the gene in the cell type under study. • A gene codes for a protein which is assembled via mRNA. • Measuring amount particular mRNA gives measure of amount of corresponding protein.

Idea: The state of the cell is determined by proteins. 1. When a gene wants to use a gene, the code of that gene is copied into mRNA in a procedure called transcription. 2. Measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be better, but is currently harder.

Reverse transcription Clone cDNA strands, complementary to the mRNA G U A A U C C U C mRNA Reverse transcriptase T T A G G A G cDNA C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G

Microarray Technology • Microarray technology allows us to measure the expression of thousands of genes at once. • Measure the expression of thousands of genes under different experimental conditions and ask what is different and why. • By adding equal amounts of the two labeled cDNA samples to the microarray, the sample cDNA will hybridize to the cDNA spots on the glass slide.

cDNA Microarray Experiments mRNA levels compared in many different contexts • Different tissues, same organism: brain versus liver • Same tissue, same organism: tumor versus non-tumor • Same tissue, different organisms: wt versus ko, tg, or mutant • Time course experiments: development • Other special designs (e.g. to detect spatial patterns).

Types of Array Experiments • mRNA transcription analysis • Single experiment (control versus experimental) • Time course (multiple samples in same experiment) • Genomic DNA -- similarity of genomes • Genetic footprinting • Species cross hybridization (existence of a specific pathway in a related species)

What do we want to know? • Genes involved in a specific biological process (i.e. heat shock) • “Guilt by association” - assumption that genes with same pattern of changes in expression are involved the same pathway • Tumor classification - predict outcome / prescribe appropriate treatment based on clustering with “known outcome” tumors

cDNA microarrays cDNA clones

cDNA microarrays PRINT cDNA from one gene on each spot SAMPLES cDNA labelled red/green Compare the genetic expression in two samples of cells e.g.treatment/control normal / tumor tissue

Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

Oligo vs cDNA arrays Lockhart and Winzler 2000

Study lipid metabolism and atherosclerosis susceptibility in mice • Scientific goal：Identify transcriptional differences in liver cells from scavenger receptor BI (SR-BI) transgenic mice and the FVB control mice. • SR-BI, a high density lipoprotein (HDL) receptor expressing mostly in the liver, has been shown to be pivotal in cholesterol uptake from the blood. • HDL cholesterol level is extremely low in SR-BI transgenic mice but is increased in SR-BI knockout mice, suggesting that SR-BI may lower plasma HDL cholesterol concentrations by promoting liver uptake. • In this study there are 8 SR-BI transgenic mice as the treatment group and 8 FVB control mice as the control group. • For each of the 16 mice, target cDNA was labeled using a red-fluorescent dye (Cy5) and was compared to a common reference sample (labeled using green-fluorescent dye Cy3) prepared from pooling the cDNA from the 8 FVB control mice. • Target cDNA was hybridized to microarrays containing 6,384 cDNA probes, including 200 related to lipid metabolism. • Large p small n problem: p=6384, n=8 or 16

Gene Expression Data • Gene expression data on p genes (variables) for n mRNA samples (observations) xij= gene expression level of gene i in mRNA sample j =

Instrument is not Perfect! • There are multiple sources of variation in measurements besides just gene expression. • Question: Can we find ways to determine when the variation due to gene expression is significant? • We want to know when the variation in measurements is caused by varying levels of gene expression versus other factors. • Several sources of variation in the measurements in microarray experiments are considered. • Array effects • Dye effects • Variety effects • Gene effects • Combinations • Refer to papers by Kerr, Churchill and their associates.

ANOVA Model for Microarray Data • yijkg: the measurement from the ith array, jth dye, kth variety, and gth gene. • μis the average measurement over all spots. • Aiis the effect of the ith array. • Djis the effect of the jth dye. • Vkis the effect of the kth variety. • Ggis the effect of the gth gene. • (AG)igis the effect of the ith array and gthgene. • (VG)kg is the effect of the kthvariety and gthgene. • eijkgare independent and identically distributed error terms. • Old technique • Write a function f(A,D,V,G) as f1(A) + f2(D) + f3(V) + f4(G) + f14(A,G) + f34(V,G).

Pros and Cons of Reference Design • Reference: pooling the cDNA from the 8 FVB control mice • Advantages: • Extendable – you can add new samples to compare against control. • Samples all use the same dye color. • Disadvantages • Get the most data on the control you care the least about. • There are complicated confounding effects associated with reference design, for example varieties and dyes are confounded, variety and array effects are partially confounded. • Model • Notice that AG effects are missing to save degrees of freedom for error estimation. D effects are absent because they are confounded with V.

Correct Dye or Print-Tip Effect • It is known that Cy3 and Cy5 are relatively unstable. • They are detected by the scanner with different efficiencies. • Different patterns of M - A plot suggests that normalization curve is slide dependent.

Normalization by controls identified a priori • Assume that some genes will not change under the treatment under investigation. • Identify those core genes in advance of the experiment. (housekeeping genes, extrinsic controls) • Normalize all genes against these genes assuming they do not change. • Limitations on using housekeeping genes: • They are biologically assumed to be non-differentially expressed genes in the experiments. • If the number of predetermined housekeeping genes is small or their intensities do not cover a range of different intensity levels this approach may not provide a good fit for nonlinear normalization curves.

Normalization by Self-Consistency • Assume that most genes will not change under the treatment under investigation. • Constant normalization factor • Need a robust procedure. • Use mean or median of each dye to normalize. • It forces the distribution of the intensity log ratios to have a median or mean of zero for each slide.

Normalization and Residual Analysis in Regression Analysis • In regression analysis, the adequacy of fitted model is often validated through a residual analysis. • Standard assumptions on linear regression analysis are • The mean of the noise e is 0. • Cov(X,e |X) = 0 • The noises ei cannot be observed. • Find their surrogates: residual ei . • How to correct for dye effect in microarray data analysis? • Key assumption: Most genes are not differentially expressed. • Expectation: The data cloud (Ri,Gi) should be around 45◦ line if there is no dye effect. • It is easier to check visually whether they fall around a horizontal line. • Dye biases depends on spot overall intensity. • Need a robust procedure since those differentially expressed genes will not around 45◦ line.

Intensity Plots log R versus log G R versus G

Normalization - Median

Normalization - Lowess

Normalization - Print Tip Lowess

Predicting type of cancer from DNA chips New feature selection SVM: Only 38 training examples, 7100 features AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

Traditional Problem: Small p Large n A two gene scenario, where everything works out fine: A new patient A A B

Difficulties with Traditional Problem Problem 1: No separating line Fisher’s proposal: Use a density function with a few parameters and find the separating line based on the likelihood ratio. Problem 2: To many separating lines

And in a few thousand dimensional spaces ... ... 1 2 3 7000

New Problem: Small n Large p • Problem 1 never exists! • Problem 2 exists almost always! Spent a minute thinking about this in three dimensions Ok, there are three genes, two patients with known diagnosis, one patient of unknown diagnosis, and separating planes instead of lines OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in praxis.

In summary: There is always a linear signature separating the entities ... a biological reason for this is not needed. Hence, if you find a separating signature, it does not mean (yet) that you have a nice publication ... ... in most cases it means nothing.

Gene Selection Multiple Hypotheses Testing

Apo AI experiment (Matt Callow, LBNL) • Goal: Identify genes with altered expression in the livers of Apo AI knock-out mice (T) compared to inbred C57Bl/6 control mice (C). • Data Collection: • 8 treatment mice and 8 control mice • 16 hybridizations: liver mRNA from each of the 16 mice (Ti,Ci) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3 • Probes: ~ 6,000 cDNAs (genes), including 200 related to lipid metabolism.

Which genes have changed?When permutation testing possible 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 [(SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2 ]) 3. Form a histogram of 6,000 t values. 4. Do a normal q-q plot; look for values “off the line”. 5. Permutation testing. 6. Adjust for multiple testing.

Replicate Slides or two group comparison: Permutation t-test: (Speed group)

Data Permutation Advantage: It doesn’t need any distribution assumption (robustness). Disadvantage: Robustness usually leads to loss of efficiency. Besides, this method needs at least around five samples in each group to justify significance.

A Basic Problem SCIENTIFIC: To determine which genes are differentially expressed between two sources of mRNA (treatment, control). STATISTICAL: To assign appropriately adjusted p-values to thousands of genes. Univariate hypothesis testing

Adjust for Multiple Testing • Single-step adjustments of pi • Bonferroni: min (mpi, 1), m= #genes • Sidák: 1 - (1 - pi)m • min P method of Westfall and Young: • Pr( min Pl≤ pi | H) • 1≤l≤m • max T method of Westfall and Young: • Pr( max |Tl | ≥ | ti | | H0C ) • 1≤l≤m

生物晶片數據分析對近代統計方法之影響

生物晶片數據分析對近代統計方法之影響

Presentation Transcript