Statistical Design and Analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010 PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010. Microarray Technology. Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously. Two types of platforms: Affymetrix (single-color)

Download Presentation

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Statistical design and analysis of microarray experiments peng liu 6 15 2010

Statistical Design and Analysis of Microarray Experiments

Peng Liu

6/15/2010


Microarray technology

Microarray Technology

  • Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously.

  • Two types of platforms:

    • Affymetrix (single-color)

    • Two-color microarray


Wild type vs myostatin knockout mice

Wild-type vs. Myostatin Knockout Mice

Belgian Blue

cattle have a

mutation in the

myostatin gene.

Design of Affymetrix experiment: one sample  one chip


Designing 2 color microarray 3 layers

Designing 2-color microarray (3 layers)

From Churchill, 2002, nature genetics


Example i sawers et al 2007 bmc bioinformatics

M

B

V

bundle sheath strands

mesophyll protoplasts

Example I: Sawers et al, 2007, BMC Bioinformatics


Example i sawers et al 2007 bmc bioinformatics1

Example I: Sawers et al, 2007, BMC Bioinformatics

  • The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types.

  • Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells.


Example i sawers et al 2007 bmc bioinformatics2

Example I: Sawers et al, 2007, BMC Bioinformatics

  • A simple method:

    Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments).


Example i sawers et al 2007 bmc bioinformatics3

Example I: Sawers et al, 2007, BMC Bioinformatics

  • A little more complication:

    The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress.

  • Solution:

    Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress.

    B, M, Stress and Total (4 treatment groups)


Direct comparison vs indirect comparison

Direct comparison vs indirect comparison

  • Direct: comparison within slide

  • Indirect: comparison between slides

  • Suppose we want to compare gene expression levels between treatment 1 and treatment 2.

2

1

2

1

R

2

1

Direct Comparison

Indirect Comparison


Comments about 2 color microarray designs

Comments about 2-color Microarray Designs

  • A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide.

  • For pairing samples, the variation due to slide can be accounted for.

  • When possible, it is more efficient to use direct comparison.

  • However, sometimes, it is not practical to make direct comparison of all possible pairs.


Efficiency of comparison

Efficiency of comparison

  • The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them.

2

1

2

1

R

2

1

Direct Comparison

(Dye-swap)

Indirect Comparison


Reference vs loop design

Reference vs Loop design

2

1

2

1

3

3

R

Reference Design

Loop Design


Designing experiment for example i

B

Total

Stress

M

Designing experiment for example I

With 6 biological replicates


Performing the experiment nature cell biol 2001 3 8

Performing the experiment (Naturecell biol. 2001 3:8)


After the bench work

After the bench work…

Affymetrix Gene Chip image

2-color microarray image


The data table looks like

The data table looks like


Pre normalization analysis

Pre-normalization analysis

  • Image processing

    • obtain the intensity measurement of the signal

  • Background correction

    • get rid of local background that might due to non-specific binding and obtain the target sample intensity

  • Filtration

    • remove unreliable spots and reduce the dimension of data

  • Transformation

    • convert data into a format that makes data analysis valid or easier


Normalization

Normalization

  • Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.

  • Aim: remove sources of systematic variation

  • Example of non-biological variation: dye difference for 2-color microarray


Figure from dudoit et al 2002 statistica sinica

Figure from Dudoit et al, 2002, Statistica Sinica

Self-self experiment


Normalization m vs a plot 45 o rotation

Normalization: M vs. A Plot (45o rotation)

Log Red-Log Green = M

(Log Green+Log Red)/2 = A


Lowess fit

LOWESS Fit

Log Red-Log Green

(Log Green+Log Red)/2


After normalization

After normalization

Normalized M

A


Statistical inference

Y224

Y114

dye

slide

treatment

Statistical Inference

  • Data notation for normalized signal intensities (NSI):

    Yijk for each gene (g)

    i: treatment index

    j: dye index

    k: slide index


Fitting linear models to microarray data

Fitting linear models to microarray data

  • After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array).

  • Together, the data is an array with each row for one gene and each column for one channel or one chip.

  • We will fit a statistical model for each gene separately.


Mean expressions for 4 treatment groups

Mean expressions for 4 treatment groups

Treatments means

  • M (M cell with stress) μ+v2+

  • B (B cell without stress) μ+v1

  • TO (both cells without stress) μ+c*v2+ (1-c)*v1

  • ST (both cells with stress) μ+c*v2+ (1-c)* v1+

  • Note that c is the proportion of M cells in the total leaf sample with both cells.

  • We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not.


Fixed effects

Fixed effects

  • The parameters on the previous slide (v1, v2, and ) specify fixed effects.

  • Fixed effects are used to specify the mean of the response variable.

  • A factor is fixedif the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another.

  • The fixed effects included in the model depend on the experimental design.


Random effects

Random effects

  • There are some random effects that are unknown:

    • slide effects

    • other effects introduced in the experiment (such as biological replicate effects)

    • residual random effects that include any sources of variation unaccounted for by other terms

B

Total

Stress

M


Random effects1

Random effects

  • Random factors are used to specify the correlation structure among the response variable observations.

    • e.g., observations on the same slide are more correlated than observations from different slides.

  • The random effects included in the model also depend on the experimental design.

  • A model that has both fixed and random effects is called a mixed model.


Detecting differentially expressed genes

Detecting differentially expressed genes

  • Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)?

    v1 - v2 0 means differential expression.

  • Model the random effects and perform tests or construct confidence intervals.

  • Perform tests for each gene and obtain a p-value.

    • Empirical Bayes test that borrows information across genes is often used because of higher power.


Results from testing

Results from testing


Statistical design and analysis of microarray experiments peng liu 6 15 2010

2536 p-values below 0.05.

0.05

We would expect around 0.05*40000=2000

p-values to be less than 0.05 by chance

if no genes were differentially expressed.


Possible errors in testing one gene

Possible Errors in Testing ONE gene

  • Type I Error: false positives

  • Type II Error: false negatives (1-power)

  • Power: true positives


Error rate in multiple testing

Error Rate in Multiple Testing

Outcomes when testing m genes

(Benjamini and Hochberg, 1995)

Family-wise error rate, FWER= Pr(V >0)

False Discovery Rate,

FDR = E(V/R |R>0) * Pr(R>0)


Results from testing for example i

Results from testing for example I


Clustering

Clustering

  • Grouping genes into different “clusters” based on their expression profile

     Clustering


Other analyses

Other analyses

  • Relating the gene expressions with biological functional categories  Gene Enrichment Test

  • Connecting microarray data with other kinds of data such as survival data.

  • More …


Assigned references

Assigned References

  • Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121.


  • Login