Statistical Design and Analysis
Sponsored Links
This presentation is the property of its rightful owner.
1 / 37

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010. Microarray Technology. Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously. Two types of platforms: Affymetrix (single-color)

Download Presentation

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Statistical Design and Analysis of Microarray Experiments

Peng Liu


Microarray Technology

  • Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously.

  • Two types of platforms:

    • Affymetrix (single-color)

    • Two-color microarray

Wild-type vs. Myostatin Knockout Mice

Belgian Blue

cattle have a

mutation in the

myostatin gene.

Design of Affymetrix experiment: one sample  one chip

Designing 2-color microarray (3 layers)

From Churchill, 2002, nature genetics




bundle sheath strands

mesophyll protoplasts

Example I: Sawers et al, 2007, BMC Bioinformatics

Example I: Sawers et al, 2007, BMC Bioinformatics

  • The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types.

  • Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells.

Example I: Sawers et al, 2007, BMC Bioinformatics

  • A simple method:

    Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments).

Example I: Sawers et al, 2007, BMC Bioinformatics

  • A little more complication:

    The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress.

  • Solution:

    Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress.

    B, M, Stress and Total (4 treatment groups)

Direct comparison vs indirect comparison

  • Direct: comparison within slide

  • Indirect: comparison between slides

  • Suppose we want to compare gene expression levels between treatment 1 and treatment 2.








Direct Comparison

Indirect Comparison

Comments about 2-color Microarray Designs

  • A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide.

  • For pairing samples, the variation due to slide can be accounted for.

  • When possible, it is more efficient to use direct comparison.

  • However, sometimes, it is not practical to make direct comparison of all possible pairs.

Efficiency of comparison

  • The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them.








Direct Comparison


Indirect Comparison

Reference vs Loop design








Reference Design

Loop Design





Designing experiment for example I

With 6 biological replicates

Performing the experiment (Naturecell biol. 2001 3:8)

After the bench work…

Affymetrix Gene Chip image

2-color microarray image

The data table looks like

Pre-normalization analysis

  • Image processing

    • obtain the intensity measurement of the signal

  • Background correction

    • get rid of local background that might due to non-specific binding and obtain the target sample intensity

  • Filtration

    • remove unreliable spots and reduce the dimension of data

  • Transformation

    • convert data into a format that makes data analysis valid or easier


  • Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.

  • Aim: remove sources of systematic variation

  • Example of non-biological variation: dye difference for 2-color microarray

Figure from Dudoit et al, 2002, Statistica Sinica

Self-self experiment

Normalization: M vs. A Plot (45o rotation)

Log Red-Log Green = M

(Log Green+Log Red)/2 = A


Log Red-Log Green

(Log Green+Log Red)/2

After normalization

Normalized M







Statistical Inference

  • Data notation for normalized signal intensities (NSI):

    Yijk for each gene (g)

    i: treatment index

    j: dye index

    k: slide index

Fitting linear models to microarray data

  • After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array).

  • Together, the data is an array with each row for one gene and each column for one channel or one chip.

  • We will fit a statistical model for each gene separately.

Mean expressions for 4 treatment groups

Treatments means

  • M (M cell with stress) μ+v2+

  • B (B cell without stress) μ+v1

  • TO (both cells without stress) μ+c*v2+ (1-c)*v1

  • ST (both cells with stress) μ+c*v2+ (1-c)* v1+

  • Note that c is the proportion of M cells in the total leaf sample with both cells.

  • We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not.

Fixed effects

  • The parameters on the previous slide (v1, v2, and ) specify fixed effects.

  • Fixed effects are used to specify the mean of the response variable.

  • A factor is fixedif the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another.

  • The fixed effects included in the model depend on the experimental design.

Random effects

  • There are some random effects that are unknown:

    • slide effects

    • other effects introduced in the experiment (such as biological replicate effects)

    • residual random effects that include any sources of variation unaccounted for by other terms





Random effects

  • Random factors are used to specify the correlation structure among the response variable observations.

    • e.g., observations on the same slide are more correlated than observations from different slides.

  • The random effects included in the model also depend on the experimental design.

  • A model that has both fixed and random effects is called a mixed model.

Detecting differentially expressed genes

  • Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)?

    v1 - v2 0 means differential expression.

  • Model the random effects and perform tests or construct confidence intervals.

  • Perform tests for each gene and obtain a p-value.

    • Empirical Bayes test that borrows information across genes is often used because of higher power.

Results from testing

2536 p-values below 0.05.


We would expect around 0.05*40000=2000

p-values to be less than 0.05 by chance

if no genes were differentially expressed.

Possible Errors in Testing ONE gene

  • Type I Error: false positives

  • Type II Error: false negatives (1-power)

  • Power: true positives

Error Rate in Multiple Testing

Outcomes when testing m genes

(Benjamini and Hochberg, 1995)

Family-wise error rate, FWER= Pr(V >0)

False Discovery Rate,

FDR = E(V/R |R>0) * Pr(R>0)

Results from testing for example I


  • Grouping genes into different “clusters” based on their expression profile

     Clustering

Other analyses

  • Relating the gene expressions with biological functional categories  Gene Enrichment Test

  • Connecting microarray data with other kinds of data such as survival data.

  • More …

Assigned References

  • Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121.

  • Login