Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Microarray Technology • Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously. • Two types of platforms: • Affymetrix (single-color) • Two-color microarray

Wild-type vs. Myostatin Knockout Mice Belgian Blue cattle have a mutation in the myostatin gene. Design of Affymetrix experiment: one sample  one chip

Designing 2-color microarray (3 layers) From Churchill, 2002, nature genetics

M B V bundle sheath strands mesophyll protoplasts Example I: Sawers et al, 2007, BMC Bioinformatics

Example I: Sawers et al, 2007, BMC Bioinformatics • The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types. • Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells.

Example I: Sawers et al, 2007, BMC Bioinformatics • A simple method: Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments).

Example I: Sawers et al, 2007, BMC Bioinformatics • A little more complication: The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress. • Solution: Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress. B, M, Stress and Total (4 treatment groups)

Direct comparison vs indirect comparison • Direct: comparison within slide • Indirect: comparison between slides • Suppose we want to compare gene expression levels between treatment 1 and treatment 2. 2 1 2 1 R 2 1 Direct Comparison Indirect Comparison

Comments about 2-color Microarray Designs • A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide. • For pairing samples, the variation due to slide can be accounted for. • When possible, it is more efficient to use direct comparison. • However, sometimes, it is not practical to make direct comparison of all possible pairs.

Efficiency of comparison • The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them. 2 1 2 1 R 2 1 Direct Comparison (Dye-swap) Indirect Comparison

Reference vs Loop design 2 1 2 1 3 3 R Reference Design Loop Design

B Total Stress M Designing experiment for example I With 6 biological replicates

Performing the experiment (Naturecell biol. 2001 3:8)

After the bench work… Affymetrix Gene Chip image 2-color microarray image

The data table looks like

Pre-normalization analysis • Image processing • obtain the intensity measurement of the signal • Background correction • get rid of local background that might due to non-specific binding and obtain the target sample intensity • Filtration • remove unreliable spots and reduce the dimension of data • Transformation • convert data into a format that makes data analysis valid or easier

Normalization • Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected. • Aim: remove sources of systematic variation • Example of non-biological variation: dye difference for 2-color microarray

Figure from Dudoit et al, 2002, Statistica Sinica Self-self experiment

Normalization: M vs. A Plot (45o rotation) Log Red-Log Green = M (Log Green+Log Red)/2 = A

LOWESS Fit Log Red-Log Green (Log Green+Log Red)/2

After normalization Normalized M A

Y224 Y114 dye slide treatment Statistical Inference • Data notation for normalized signal intensities (NSI): Yijk for each gene (g) i: treatment index j: dye index k: slide index

Fitting linear models to microarray data • After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array). • Together, the data is an array with each row for one gene and each column for one channel or one chip. • We will fit a statistical model for each gene separately.

Mean expressions for 4 treatment groups Treatments means • M (M cell with stress) μ+v2+ • B (B cell without stress) μ+v1 • TO (both cells without stress) μ+c*v2+ (1-c)*v1 • ST (both cells with stress) μ+c*v2+ (1-c)* v1+ • Note that c is the proportion of M cells in the total leaf sample with both cells. • We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not.

Fixed effects • The parameters on the previous slide (v1, v2, and ) specify fixed effects. • Fixed effects are used to specify the mean of the response variable. • A factor is fixedif the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another. • The fixed effects included in the model depend on the experimental design.

Random effects • There are some random effects that are unknown: • slide effects • other effects introduced in the experiment (such as biological replicate effects) • residual random effects that include any sources of variation unaccounted for by other terms B Total Stress M

Random effects • Random factors are used to specify the correlation structure among the response variable observations. • e.g., observations on the same slide are more correlated than observations from different slides. • The random effects included in the model also depend on the experimental design. • A model that has both fixed and random effects is called a mixed model.

Detecting differentially expressed genes • Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)? v1 - v2 0 means differential expression. • Model the random effects and perform tests or construct confidence intervals. • Perform tests for each gene and obtain a p-value. • Empirical Bayes test that borrows information across genes is often used because of higher power.

Results from testing

2536 p-values below 0.05. 0.05 We would expect around 0.05*40000=2000 p-values to be less than 0.05 by chance if no genes were differentially expressed.

Possible Errors in Testing ONE gene • Type I Error: false positives • Type II Error: false negatives (1-power) • Power: true positives

Error Rate in Multiple Testing Outcomes when testing m genes (Benjamini and Hochberg, 1995) Family-wise error rate, FWER= Pr(V >0) False Discovery Rate, FDR = E(V/R |R>0) * Pr(R>0)

Results from testing for example I

Clustering • Grouping genes into different “clusters” based on their expression profile  Clustering

Other analyses • Relating the gene expressions with biological functional categories  Gene Enrichment Test • Connecting microarray data with other kinds of data such as survival data. • More …

Assigned References • Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121.

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Presentation Transcript

Analysis of Time Course Microarray Experiments

Statistical Design of Experiments

Microarray Design and Analysis

Statistical Analysis of Microarray Data

Design and Analysis of Experiments

Statistical Design of Experiments

Design and Analysis of Experiments

Design and Analysis of Experiments

Design and Analysis of Experiments

Design and Analysis of Experiments

Statistical Design of Experiments

Microarray experiments. Database and Analysis Tools.

Statistical Design of Experiments

Statistical Analysis of Microarray Data

Statistical Issues in the Design of Microarray Experiments

Statistical Design of Experiments

Statistical Analysis of DNA Microarray.

Design and Analysis of Microarray Experiments at CSIRO Livestock Industries

Statistical Experiments and Design

Statistical Design of Experiments

Microarray experiments: Database and Analysis Tools.

Statistical Analysis of Microarray Data