Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

1 / 38

# Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I - PowerPoint PPT Presentation

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I. Course 341 Department of Computing Imperial College, London Moustafa Ghanem. Lecture Overview. Motivation Identifying differentially expressed genes Calculating effect: fold ratio

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I' - aadi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Introduction to Bioinformatics5. Statistical Analysis of Gene Expression Matrices I

Course 341

Department of Computing

Imperial College, London

Moustafa Ghanem

Lecture Overview
• Motivation
• Identifying differentially expressed genes
• Calculating effect: fold ratio
• Calculating significance: p-values
• Statistical Analysis
• Paired and unpaired experiments
• Need for significance testing
• Hypothesis testing
• t-tests and p-values
• t-tests
• Paired and unpaired t-tests
• Formulae for t-test
• Single-tail vs. two tails t-tests
• Looking up p-values
MotivationLarge-scale Differential Gene Expression Analysis
• Consider a microarray experiment
• that measures gene expression in two groups of rat tissue (>5000 genes in each experiment).
• The rat tissues come from two groups:
• WT: Wild-Type rat tissue,
• KO: Knock Out Treatment rat tissue
• Gene expression for each group measured under similar conditions
• Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?
Calculating Expression Ratios
• In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.:
• Tumour cell lines vs. Normal cell lines
• Treated tissue vs. diseased tissue
• Different tissues, same organism
• Same tissue, different organisms
• Same tissue, same organism
• Time course experiments
• We can quantify the difference (effect) by taking a ratio
• i.e. for gene k, this is the ratio between expression in state a compared to expression in state b
• This provides a relative value of change (e.g. expression has doubled)
• If expression level has not changed ratio is 1

A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2

A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2

Fold change(Fold ratio)
• Ratios are troublesome since
• Up-regulated & Down-regulated genes treated differently
• Genes up-regulated by a factor of 2 have a ratio of 2
• Genes down-regulated by same factor (2) have a ratio of 0.5
• As a result
• down regulated genes are compressed between 1 and 0
• up-regulated genes expand between 1 and infinity
• Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change

A, B and D are down regulated

C is up-regulated

E has no change

Examples of fold change
• You can calculate Fold change between pairs of expression values:
• e.g. Between State 1 vs State 2 for gene A
• Or Between mean values of all measurements for a gene in the WT/KO experiments
• mean(WT1..WT4) vs mean (KO1..KO4)
StatisticsBack to our problems

4 Wild KO samples (Red)

Columns represent samples

4 Wild Type samples (Blue)

5000 Rows represent genes

StatisticsSignificance ofFold Change
• For our problem we can calculate an average fold ratio for each gene (each row)
• This will give us an average effect value for each gene
• 2, 1.7, 10, 100, etc
• Question which of these values are significant?
• Can use a threshold, but what threshold value should we set?
• Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.
Statistics:5000 separate statistical problems
• Effectively:
• 5000 separate experiments where each experiment measures the expression of one gene in two groups of 4 individuals
• For each experiment (gene), want to establish if there is a statistical difference between the reported values in each group
• We then want to identify those genes (across the 5000 genes) that have a significant change
• Each row in our table is similar to one of those of traditional statistical analysis problems

Condition

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments
• Overall setting: 2 groups of 4 individuals each
• Group1: Imperial students
• Group2: UCL students
• Experiment 1:
• We measure the height of all students
• We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant
• Experiment 2:
• We measure the weight of all students
• We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant
• Experiment 3:
• ………

Condition

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments
• In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group
• e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups:
• One groups takes the drug
• Another groups takes a placebo
• You measure (quantify) effect of both groups some time later
• You want to establish whether there is a significant difference between both groups at that later point
• The WT/KO example is an unpaired experiment if the rats in the experiments are different !
StatisticsUnpaired statistical experiments
• The WT/KO example is an unpaired experiment if the rats in the experiments are different!
StatisticsUnpaired statistical experiments
• How do we address the problem?
• Compare two sets of results (alternatively calculate mean for each group and compare means)
• Graphically:
• Scatter Plots
• Box plots, etc
• Compare Statistically
• Use unpaired t-test

Are these two series significantly different?

Are these two series significantly different?

Condition 1

Condition 2

Group members

StatisticsPaired statistical experiments
• Overall setting: 1 groups of 4 individuals each
• Group1: Imperial students
• We make measurements for each student in two situations
• Experiment 1:
• We measure the height of all students before Bioinformatics course and after Bioinformatics course
• We want to establish if Bioinformatics course consistently (or on average) affects students’ heights
• Experiment 2:
• We measure the weight of all students before Bioinformatics course and after Bioinformatics
• We want to establish if Bioinformatics course consistently (or on average) affects students’ weights
• Experiment 3:
• ………

Condition 1

Condition 2

Group members

StatisticsPaired statistical experiments
• In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after)
• e.g. you want to test the effectiveness of a new cream for tanning
• You measure the tan in each individual before the cream is applied
• You measure the tan in each individual after the cream is applied
• You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole
StatisticsPaired statistical experiments
• The WT/KO example is a paired experiment if the rats in the experiments are the same!
StatisticsPaired statistical experiments
• How do we address the problem?
• Calculate difference for each pair
• Compare differences to zero
• Alternatively (compare average difference to zero)
• Graphically:
• Scatter Plot of difference
• Box plots, etc
• Statistically
• Use unpaired t-test

Are differences close to Zero?

StatisticsSignificance testing
• In both cases (paired and unpaired) you want to establish whether the difference is significant
• Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance.
• To do this, you need to review some basic statistics
• Normal distributions: mean, standard deviations, etc
• Hypothesis Testing
• t-distributions
• t-tests and p-values

68% of dist.

1 s.d.

1 s.d.

x

Mean and standard deviation
• Mean and standard deviation tell you the basic features of a distribution
• mean = average value of all members of the group

u = (x1+x2+x3 ….+xN)/N

• standard deviation = a measure of how much the values of individual members vary in relation to the mean
• The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean
Note on s.d. calculation
• Through the following slides and in the tutorials, I use the following formula for calculating standard deviation
• Some people use the unbiased form below (for good reasons)
• Please use the simple form if you want the answers to add up at the end

68% of dist.

1 s.d.

1 s.d.

x

The Normal Distribution

Many continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in;

• The x-axis represents the values of a particular variable
• The y-axis represents the proportion of members of the population that have each value of the variable
• The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range

(mean 0, s.d. = 1)

using a simple transform

0.025 = p-value: probability of a measurement value belonging to this distribution

Normal Distribution and Confidence Intervals

a/2 = 0.025

a/2 = 0.025

1-a = 0.95

-1.96

1.96

Hypothesis Testing: (Unpaired)Are two data sets different
• We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same)
• We pose a null hypothesis that the means are equal
• We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal)
• if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different)
• If probability is high (high p) accept null hypothesis (both means are equal)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

Comparing Two SamplesGraphical interpretation
• To compare two groups you can compare the mean of one group graphically.
• The graphical comparison allows you to visually see the distribution of the two groups.
• If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups.
• We can set a critical value for the x-axis based on the threshold of p-value

In paired experiments, we compare the mean difference.

Hypothesis Testing: (Paired)Are two data sets different
• We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known
• We pose a null hypothesis that the mean difference is zero
• We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (mean of difference is 0)
• if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (mean of difference <>0)
• If probability is high (high p) accept null hypothesis (mean of difference is 0)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

Typically known as Student t-test

The t-test
• In most cases we use what is know as a t-test rather than the z-test when comparing samples.
• In particular when we have
• small data sets (less than 30 each) and
• we don’t know the s.d. and have to calculate it from the small samples
• Same concepts as before apply, but we base the test on what is known as the t-distribution, which approximates the normal distribution for small samples
• We have to calculate what is know as a t-value!
The t-distribution
• In fact we have many t-distributions, each one is calculated in reference to the number of degrees of freedom (d.f.)also know as variables (v)

Normal distribution

t-distribution

t-test terminology
• t-test: Used to compare the mean of a sample to a known number (often 0).
• Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal.
• Test: The hypotheses for a single sample t-test are:
• Ho: u = u0
• Ha: u < > u0
• p-value: probability of error in rejecting the hypothesis of no difference between the two groups.

(where u0 denotes the hypothesized value to which you are comparing a population mean)

t-Tests terminology Single-tail vs. two-tail
• What am I testing for:
• Right Tail: (group1 > group2)
• Left Tail: (group1 < group2)
• Two Tail: Both groups are different but I don’t care how.

H0: m 1£ m 2

H1: m 1 > m 2

H0: m 1 - m 2£ 0

H1: m 1 - m 2> 0

Right Tail

OR

H1: m 1 < m 2

H0: m 1 - m 2³ 0 H1: m 1 - m 2 < 0

OR

Left Tail

H0: m 1³ m 2

H0: m 1 = m 2H1: m 1¹m 2

H0: m 1 -m 2 = 0 H1: m 1 - m 2 ¹ 0

Two Tail

OR

t-test terminologyUnpaired vs. paired t-test
• Same as before !! Depends on your experiment
• Unpaired t-Test: The hypotheses for the comparison of two independent groups are:
• Ho: u1 = u2 (means of the two groups are equal)
• Ha: u1 <> u2 (means of the two group are not equal)
• Paired t-test: The hypothesis of paired measurements in same individuals
• Ho: D = 0 (the difference between the two observations is 0)
• Ha: D <> 0 (the difference is not 0)

Where d is calculated by

Remember these formulae !!

Calculating t-test (t statistic)
• First calculate t statistic value and then calculate p value

For the paired t-test, t is calculated using the following formula:

And n is the number of pairs being tested.

• For an unpaired (independent group) t-test, the following formula is used:

Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.

Calculating p-value for t-test
• When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’.
• There are three methods for calculating P:
• One Tailed >:
• One Tailed <:
• Two Tailed:
• Where p(t,v) is looked up from the t-distribution table
• The number of degrees (v) of freedom is calculated as:
• UnPaired: n(x) +n (y) -2
• Paired: n- 1 (where n is the number of pairs.)
p-values
• Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative.
• In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.
Calculating t and p values
• You will usually use a piece of software to calculate t and P
• (Excel provides that !).
• In a problems
• You can assume access to a function p(t,v) which calculates p for a given t value and v (number of degrees of freedom)
• or alternatively have a table indexed by critical t values and v
t-value and p-value
• Given a t-value, and degrees of freedom, you can look-up a p-value
• Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t

Reject H

Reject H

0

0

.025

.025

t

-2.0154

0

2.0154

t-test Interpretation

Note as t increases, p decreases

t (value) must > t (critical on table) by P level

A

A

= .05

= .05

-tc

Finding a critical t
• The table provides the t values (tc) for which P(tx > tc) = A

tc

=-1.812

=1.812

t.100

t.05

t.025

t.01

t.005

Summary
• Differential analysis
• Uses fold ratio (fold change) for measuring effect
• Need some measure of significance of such effect.
• Statistical analysis
• Paired vs. unpaired experiments
• t-tests
• Calculating t for paired/un-paired experiments
• Deciding single tail vs. two-tail
• Calculating degrees of freedom
• Look-up p value