Loading in 2 Seconds...

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Loading in 2 Seconds...

- By
**aadi** - Follow User

- 159 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I' - aadi

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Introduction to Bioinformatics5. Statistical Analysis of Gene Expression Matrices I

Course 341

Department of Computing

Imperial College, London

Moustafa Ghanem

Lecture Overview

- Motivation
- Identifying differentially expressed genes
- Calculating effect: fold ratio
- Calculating significance: p-values
- Statistical Analysis
- Paired and unpaired experiments
- Need for significance testing
- Hypothesis testing
- t-tests and p-values
- t-tests
- Paired and unpaired t-tests
- Formulae for t-test
- Single-tail vs. two tails t-tests
- Looking up p-values

MotivationLarge-scale Differential Gene Expression Analysis

- Consider a microarray experiment
- that measures gene expression in two groups of rat tissue (>5000 genes in each experiment).
- The rat tissues come from two groups:
- WT: Wild-Type rat tissue,
- KO: Knock Out Treatment rat tissue
- Gene expression for each group measured under similar conditions
- Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?

Calculating Expression Ratios

- In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.:
- Tumour cell lines vs. Normal cell lines
- Treated tissue vs. diseased tissue
- Different tissues, same organism
- Same tissue, different organisms
- Same tissue, same organism
- Time course experiments
- We can quantify the difference (effect) by taking a ratio
- i.e. for gene k, this is the ratio between expression in state a compared to expression in state b
- This provides a relative value of change (e.g. expression has doubled)
- If expression level has not changed ratio is 1

A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2

A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2

Fold change(Fold ratio)- Ratios are troublesome since
- Up-regulated & Down-regulated genes treated differently
- Genes up-regulated by a factor of 2 have a ratio of 2
- Genes down-regulated by same factor (2) have a ratio of 0.5
- As a result
- down regulated genes are compressed between 1 and 0
- up-regulated genes expand between 1 and infinity
- Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change

C is up-regulated

E has no change

Examples of fold change- You can calculate Fold change between pairs of expression values:
- e.g. Between State 1 vs State 2 for gene A
- Or Between mean values of all measurements for a gene in the WT/KO experiments
- mean(WT1..WT4) vs mean (KO1..KO4)

StatisticsBack to our problems

4 Wild KO samples (Red)

Columns represent samples

4 Wild Type samples (Blue)

5000 Rows represent genes

StatisticsSignificance ofFold Change

- For our problem we can calculate an average fold ratio for each gene (each row)
- This will give us an average effect value for each gene
- 2, 1.7, 10, 100, etc
- Question which of these values are significant?
- Can use a threshold, but what threshold value should we set?
- Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.

Statistics:5000 separate statistical problems

- How do we think about this problem?
- Effectively:
- 5000 separate experiments where each experiment measures the expression of one gene in two groups of 4 individuals
- For each experiment (gene), want to establish if there is a statistical difference between the reported values in each group
- We then want to identify those genes (across the 5000 genes) that have a significant change
- Each row in our table is similar to one of those of traditional statistical analysis problems

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments- Overall setting: 2 groups of 4 individuals each
- Group1: Imperial students
- Group2: UCL students
- Experiment 1:
- We measure the height of all students
- We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant
- Experiment 2:
- We measure the weight of all students
- We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant
- Experiment 3:
- ………

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments- In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group
- e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups:
- One groups takes the drug
- Another groups takes a placebo
- You measure (quantify) effect of both groups some time later
- You want to establish whether there is a significant difference between both groups at that later point
- The WT/KO example is an unpaired experiment if the rats in the experiments are different !

StatisticsUnpaired statistical experiments

- The WT/KO example is an unpaired experiment if the rats in the experiments are different!

StatisticsUnpaired statistical experiments

- How do we address the problem?
- Compare two sets of results (alternatively calculate mean for each group and compare means)
- Graphically:
- Scatter Plots
- Box plots, etc
- Compare Statistically
- Use unpaired t-test

Are these two series significantly different?

Are these two series significantly different?

Condition 2

Group members

StatisticsPaired statistical experiments- Overall setting: 1 groups of 4 individuals each
- Group1: Imperial students
- We make measurements for each student in two situations
- Experiment 1:
- We measure the height of all students before Bioinformatics course and after Bioinformatics course
- We want to establish if Bioinformatics course consistently (or on average) affects students’ heights
- Experiment 2:
- We measure the weight of all students before Bioinformatics course and after Bioinformatics
- We want to establish if Bioinformatics course consistently (or on average) affects students’ weights
- Experiment 3:
- ………

Condition 2

Group members

StatisticsPaired statistical experiments- In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after)
- e.g. you want to test the effectiveness of a new cream for tanning
- You measure the tan in each individual before the cream is applied
- You measure the tan in each individual after the cream is applied
- You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole

StatisticsPaired statistical experiments

- The WT/KO example is a paired experiment if the rats in the experiments are the same!

StatisticsPaired statistical experiments

- How do we address the problem?
- Calculate difference for each pair
- Compare differences to zero
- Alternatively (compare average difference to zero)
- Graphically:
- Scatter Plot of difference
- Box plots, etc
- Statistically
- Use unpaired t-test

Are differences close to Zero?

StatisticsSignificance testing

- In both cases (paired and unpaired) you want to establish whether the difference is significant
- Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance.
- To do this, you need to review some basic statistics
- Normal distributions: mean, standard deviations, etc
- Hypothesis Testing
- t-distributions
- t-tests and p-values

1 s.d.

1 s.d.

x

Mean and standard deviation- Mean and standard deviation tell you the basic features of a distribution
- mean = average value of all members of the group

u = (x1+x2+x3 ….+xN)/N

- standard deviation = a measure of how much the values of individual members vary in relation to the mean
- The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean

Note on s.d. calculation

- Through the following slides and in the tutorials, I use the following formula for calculating standard deviation
- Some people use the unbiased form below (for good reasons)
- Please use the simple form if you want the answers to add up at the end

1 s.d.

1 s.d.

x

The Normal DistributionMany continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in;

- The x-axis represents the values of a particular variable
- The y-axis represents the proportion of members of the population that have each value of the variable
- The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range

Any normal distribution can be transformed to a standard distribution

(mean 0, s.d. = 1)

using a simple transform

0.025 = p-value: probability of a measurement value belonging to this distribution

Normal Distribution and Confidence Intervalsa/2 = 0.025

a/2 = 0.025

1-a = 0.95

-1.96

1.96

In unpaired experiments, we compare the difference between the means.

Hypothesis Testing: (Unpaired)Are two data sets different- We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same)
- We pose a null hypothesis that the means are equal
- We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal)
- if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different)
- If probability is high (high p) accept null hypothesis (both means are equal)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

Comparing Two SamplesGraphical interpretation

- To compare two groups you can compare the mean of one group graphically.
- The graphical comparison allows you to visually see the distribution of the two groups.
- If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups.
- We can set a critical value for the x-axis based on the threshold of p-value

In paired experiments, we compare the mean difference.

Hypothesis Testing: (Paired)Are two data sets different- We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known
- We pose a null hypothesis that the mean difference is zero
- We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (mean of difference is 0)
- if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (mean of difference <>0)
- If probability is high (high p) accept null hypothesis (mean of difference is 0)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

Typically known as Student t-test

The t-test- In most cases we use what is know as a t-test rather than the z-test when comparing samples.
- In particular when we have
- small data sets (less than 30 each) and
- we don’t know the s.d. and have to calculate it from the small samples
- Same concepts as before apply, but we base the test on what is known as the t-distribution, which approximates the normal distribution for small samples
- We have to calculate what is know as a t-value!

We will see how we calculate the degrees of freedom in a short while

The t-distribution- In fact we have many t-distributions, each one is calculated in reference to the number of degrees of freedom (d.f.)also know as variables (v)

Normal distribution

t-distribution

t-test terminology

- t-test: Used to compare the mean of a sample to a known number (often 0).
- Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal.
- Test: The hypotheses for a single sample t-test are:
- Ho: u = u0
- Ha: u < > u0
- p-value: probability of error in rejecting the hypothesis of no difference between the two groups.

(where u0 denotes the hypothesized value to which you are comparing a population mean)

t-Tests terminology Single-tail vs. two-tail

- What am I testing for:
- Right Tail: (group1 > group2)
- Left Tail: (group1 < group2)
- Two Tail: Both groups are different but I don’t care how.

H0: m 1£ m 2

H1: m 1 > m 2

H0: m 1 - m 2£ 0

H1: m 1 - m 2> 0

Right Tail

OR

H1: m 1 < m 2

H0: m 1 - m 2³ 0 H1: m 1 - m 2 < 0

OR

Left Tail

H0: m 1³ m 2

H0: m 1 = m 2H1: m 1¹m 2

H0: m 1 -m 2 = 0 H1: m 1 - m 2 ¹ 0

Two Tail

OR

t-test terminologyUnpaired vs. paired t-test

- Same as before !! Depends on your experiment
- Unpaired t-Test: The hypotheses for the comparison of two independent groups are:
- Ho: u1 = u2 (means of the two groups are equal)
- Ha: u1 <> u2 (means of the two group are not equal)
- Paired t-test: The hypothesis of paired measurements in same individuals
- Ho: D = 0 (the difference between the two observations is 0)
- Ha: D <> 0 (the difference is not 0)

Remember these formulae !!

Calculating t-test (t statistic)- First calculate t statistic value and then calculate p value

For the paired t-test, t is calculated using the following formula:

And n is the number of pairs being tested.

- For an unpaired (independent group) t-test, the following formula is used:

Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.

Calculating p-value for t-test

- When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’.
- There are three methods for calculating P:
- One Tailed >:
- One Tailed <:
- Two Tailed:
- Where p(t,v) is looked up from the t-distribution table
- The number of degrees (v) of freedom is calculated as:
- UnPaired: n(x) +n (y) -2
- Paired: n- 1 (where n is the number of pairs.)

p-values

- Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative.
- In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.

Calculating t and p values

- You will usually use a piece of software to calculate t and P
- (Excel provides that !).
- In a problems
- You can assume access to a function p(t,v) which calculates p for a given t value and v (number of degrees of freedom)
- or alternatively have a table indexed by critical t values and v

t-value and p-value

- Given a t-value, and degrees of freedom, you can look-up a p-value
- Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t

Reject H

0

0

.025

.025

t

-2.0154

0

2.0154

t-test InterpretationNote as t increases, p decreases

t (value) must > t (critical on table) by P level

A

= .05

= .05

-tc

Finding a critical t- The table provides the t values (tc) for which P(tx > tc) = A

tc

=-1.812

=1.812

t.100

t.05

t.025

t.01

t.005

Summary

- Differential analysis
- Uses fold ratio (fold change) for measuring effect
- Need some measure of significance of such effect.
- Statistical analysis
- Paired vs. unpaired experiments
- t-tests
- Calculating t for paired/un-paired experiments
- Deciding single tail vs. two-tail
- Calculating degrees of freedom
- Look-up p value

Download Presentation

Connecting to Server..