Statistical Analysis of
Download
1 / 26

Statistical Analysis of Microarray Data - PowerPoint PPT Presentation


  • 145 Views
  • Uploaded on

Statistical Analysis of Microarray Data. By H. Bjørn Nielsen & Hanne Jarmer. Question Experimental Design. The DNA Array Analysis Pipeline. Array design Probe design. Sample Preparation Hybridization. Buy Chip/Array. Image analysis. Normalization. Expression Index Calculation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Statistical Analysis of Microarray Data' - edna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Statistical Analysis of Microarray Data

By

H. Bjørn Nielsen & Hanne Jarmer


The dna array analysis pipeline

Question

Experimental Design

The DNA Array Analysis Pipeline

Array design

Probe design

Sample Preparation

Hybridization

Buy Chip/Array

Image analysis

Normalization

Expression Index

Calculation

Comparable

Gene Expression Data

Statistical Analysis

Fit to Model (time series)

Advanced Data Analysis

Clustering PCA Classification Promoter Analysis

Meta analysis Survival analysis Regulatory Network


What's the question?

Typically we want to identify differentially expressed genes

Example:

alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media

alcohol dehydrogenase

without alcohol

with alcohol


He’s going to say it

However, the measurements contain stochastic noise

There is no way around it

Statistics


statistics

Noisy

measurements

p-value

You can choose to think of statistics as a black box

But, you still need to understand how to interpret the results


The output of the statistics

P-value

The chance of rejecting the null hypothesis by coincidence

----------------------------

For gene expression analysis we can say:

the chance that a gene is categorized as differentially expressed by coincidence


The statistics gives us a p-value for each gene

We can rank the genes according to the p-value

But, we can’t really trust the p-value in a strict statistical way!


Why not!

For two reasons:

We are rarely fulfilling all the assumptions of the statistical test

We have to take multi-testing into account


The t-test Assumptions

1. The observations in the two categories must be independent

2. The observations should be normally distributed

3. The sample size must be ‘large’ (>30 replicates)


Multi-testing?

In a typical microarray analysis we test thousands of genes

If we use a significance level of 0.05

and we test 1000 genes. We expect 50 genes to be significant by chance

1000 x 0.05 = 50


Bonferroni:

P ≤

Confidence level of 99%

0.01

N

Benjamini-Hochberg:

P ≤

i

N

0.01

Correction for multiple testing

N = number of genes

i = number of accepted genes


But really, those methods are too strict

What we can trust is the ability of the statistical test to rank the genes according to their reliability

The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff

If we permute the samples we can get an estimate of the False Discovery Rate (FDR) in this set


P-value

log2 fold change (M)

Volcano Plot



The t-test

Calculate T

Lookup T in a table


Density

Intensity

of gene x

The t-test II

The t-test tests for difference in means ()

mut mutant

wt wt


The t-test III

The t statistic is based on the sample mean and variance

t


ANOVA

ANalysis Of Variance

Very similar to the t-test, but can test multiple categories

Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2

Advantage: it has more ‘power’ than the t-test


Variance between groups

Variance within groups

ANOVA II

Density

Intensity


Blocks and paired tests

Some undesired factors may influence the experiments, the effect of such can be greatly reduced if they are blocked out or if the experiment is paired.

Some possible blocks:

- Dye

- Patient

- Technician

- Batch

- Day of experiment

- Array


Paired t-test

Hypothesis:

There is no difference between the mean BLUE and RED

Block 1

Block 2

Block 3


A187 CreA

Glucose Ethanol Glucose Ethanol

3x

An example: (2-way ANOVA with blocks)

Experimental Design:

Everything was done in batches to capture the systematic noise


Example batch to batch variation
Example: Batch to batch variation

  • Within batch variation is lower than the between batch variation


Example data analysis blocking

Two-way ANOVA

Effect 1

Ethanol

Glucose

Effect 2

A187 CreA

Batch ABC

Example: Data analysis - Blocking

  • We can capture the batch variation by blocking


Ex result of the 2 way anova 3 p values

A genotype p-value

A growth media p-value

Two-way ANOVA

An interaction p-value

Media effect

Glucose

Ethanol

Genotype

effect

A187 CreA

Batch ABC

Ex: Result of the 2-way ANOVA(3 p-values)


Conclusion
Conclusion

  • Array data contains stochastic noise

    • Therefore statistics is needed to conclude on differential expression

  • We can’t really trust the p-value

  • But the statistics can rank genes

  • The capacity/needs of downstream processes can be used to set cutoff

  • FDR can be estimated

  • t-test is used for two category tests

  • ANOVA is used for multiple categories


ad