Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Download Presentation

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Loading in 2 Seconds...

- 119 Views
- Uploaded on
- Presentation posted in: General

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Statistical Design and Analysis of Microarray Experiments

Peng Liu

6/15/2010

- Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously.
- Two types of platforms:
- Affymetrix (single-color)
- Two-color microarray

Belgian Blue

cattle have a

mutation in the

myostatin gene.

Design of Affymetrix experiment: one sample one chip

From Churchill, 2002, nature genetics

M

B

V

bundle sheath strands

mesophyll protoplasts

- The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types.
- Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells.

- A simple method:
Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments).

- A little more complication:
The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress.

- Solution:
Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress.

B, M, Stress and Total (4 treatment groups)

- Direct: comparison within slide
- Indirect: comparison between slides
- Suppose we want to compare gene expression levels between treatment 1 and treatment 2.

2

1

2

1

R

2

1

Direct Comparison

Indirect Comparison

- A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide.
- For pairing samples, the variation due to slide can be accounted for.
- When possible, it is more efficient to use direct comparison.
- However, sometimes, it is not practical to make direct comparison of all possible pairs.

- The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them.

2

1

2

1

R

2

1

Direct Comparison

(Dye-swap)

Indirect Comparison

2

1

2

1

3

3

R

Reference Design

Loop Design

B

Total

Stress

M

With 6 biological replicates

Affymetrix Gene Chip image

2-color microarray image

- Image processing
- obtain the intensity measurement of the signal

- Background correction
- get rid of local background that might due to non-specific binding and obtain the target sample intensity

- Filtration
- remove unreliable spots and reduce the dimension of data

- Transformation
- convert data into a format that makes data analysis valid or easier

- Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.
- Aim: remove sources of systematic variation
- Example of non-biological variation: dye difference for 2-color microarray

Self-self experiment

Log Red-Log Green = M

(Log Green+Log Red)/2 = A

Log Red-Log Green

(Log Green+Log Red)/2

Normalized M

A

Y224

Y114

dye

slide

treatment

- Data notation for normalized signal intensities (NSI):
Yijk for each gene (g)

i: treatment index

j: dye index

k: slide index

- After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array).
- Together, the data is an array with each row for one gene and each column for one channel or one chip.
- We will fit a statistical model for each gene separately.

Treatments means

- M (M cell with stress) μ+v2+
- B (B cell without stress) μ+v1
- TO (both cells without stress) μ+c*v2+ (1-c)*v1
- ST (both cells with stress) μ+c*v2+ (1-c)* v1+
- Note that c is the proportion of M cells in the total leaf sample with both cells.
- We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not.

- The parameters on the previous slide (v1, v2, and ) specify fixed effects.
- Fixed effects are used to specify the mean of the response variable.
- A factor is fixedif the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another.
- The fixed effects included in the model depend on the experimental design.

- There are some random effects that are unknown:
- slide effects
- other effects introduced in the experiment (such as biological replicate effects)
- residual random effects that include any sources of variation unaccounted for by other terms

B

Total

Stress

M

- Random factors are used to specify the correlation structure among the response variable observations.
- e.g., observations on the same slide are more correlated than observations from different slides.

- The random effects included in the model also depend on the experimental design.
- A model that has both fixed and random effects is called a mixed model.

- Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)?
v1 - v2 0 means differential expression.

- Model the random effects and perform tests or construct confidence intervals.
- Perform tests for each gene and obtain a p-value.
- Empirical Bayes test that borrows information across genes is often used because of higher power.

2536 p-values below 0.05.

0.05

We would expect around 0.05*40000=2000

p-values to be less than 0.05 by chance

if no genes were differentially expressed.

- Type I Error: false positives
- Type II Error: false negatives (1-power)
- Power: true positives

Outcomes when testing m genes

(Benjamini and Hochberg, 1995)

Family-wise error rate, FWER= Pr(V >0)

False Discovery Rate,

FDR = E(V/R |R>0) * Pr(R>0)

- Grouping genes into different “clusters” based on their expression profile
Clustering

- Relating the gene expressions with biological functional categories Gene Enrichment Test
- Connecting microarray data with other kinds of data such as survival data.
- More …

- Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121.