introduction to bioinformatics 5 statistical analysis of gene expression matrices i l.
Download
Skip this Video
Download Presentation
Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Loading in 2 Seconds...

play fullscreen
1 / 38

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I - PowerPoint PPT Presentation


  • 159 Views
  • Uploaded on

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I. Course 341 Department of Computing Imperial College, London Moustafa Ghanem. Lecture Overview. Motivation Identifying differentially expressed genes Calculating effect: fold ratio

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I' - aadi


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to bioinformatics 5 statistical analysis of gene expression matrices i

Introduction to Bioinformatics5. Statistical Analysis of Gene Expression Matrices I

Course 341

Department of Computing

Imperial College, London

Moustafa Ghanem

lecture overview
Lecture Overview
  • Motivation
    • Identifying differentially expressed genes
    • Calculating effect: fold ratio
    • Calculating significance: p-values
  • Statistical Analysis
    • Paired and unpaired experiments
    • Need for significance testing
    • Hypothesis testing
    • t-tests and p-values
  • t-tests
    • Paired and unpaired t-tests
    • Formulae for t-test
    • Single-tail vs. two tails t-tests
    • Looking up p-values
motivation large scale differential gene expression analysis
MotivationLarge-scale Differential Gene Expression Analysis
  • Consider a microarray experiment
    • that measures gene expression in two groups of rat tissue (>5000 genes in each experiment).
    • The rat tissues come from two groups:
      • WT: Wild-Type rat tissue,
      • KO: Knock Out Treatment rat tissue
    • Gene expression for each group measured under similar conditions
    • Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?
calculating expression ratios
Calculating Expression Ratios
  • In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.:
    • Tumour cell lines vs. Normal cell lines
    • Treated tissue vs. diseased tissue
    • Different tissues, same organism
    • Same tissue, different organisms
    • Same tissue, same organism
    • Time course experiments
  • We can quantify the difference (effect) by taking a ratio
  • i.e. for gene k, this is the ratio between expression in state a compared to expression in state b
    • This provides a relative value of change (e.g. expression has doubled)
    • If expression level has not changed ratio is 1
fold change fold ratio

A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2

A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2

Fold change(Fold ratio)
  • Ratios are troublesome since
    • Up-regulated & Down-regulated genes treated differently
      • Genes up-regulated by a factor of 2 have a ratio of 2
      • Genes down-regulated by same factor (2) have a ratio of 0.5
    • As a result
      • down regulated genes are compressed between 1 and 0
      • up-regulated genes expand between 1 and infinity
  • Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change
examples of fold change

A, B and D are down regulated

C is up-regulated

E has no change

Examples of fold change
  • You can calculate Fold change between pairs of expression values:
  • e.g. Between State 1 vs State 2 for gene A
  • Or Between mean values of all measurements for a gene in the WT/KO experiments
    • mean(WT1..WT4) vs mean (KO1..KO4)
statistics back to our problems
StatisticsBack to our problems

4 Wild KO samples (Red)

Columns represent samples

4 Wild Type samples (Blue)

5000 Rows represent genes

statistics significance of fold change
StatisticsSignificance ofFold Change
  • For our problem we can calculate an average fold ratio for each gene (each row)
  • This will give us an average effect value for each gene
    • 2, 1.7, 10, 100, etc
  • Question which of these values are significant?
    • Can use a threshold, but what threshold value should we set?
    • Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.
statistics 5000 separate statistical problems
Statistics:5000 separate statistical problems
  • How do we think about this problem?
  • Effectively:
    • 5000 separate experiments where each experiment measures the expression of one gene in two groups of 4 individuals
    • For each experiment (gene), want to establish if there is a statistical difference between the reported values in each group
    • We then want to identify those genes (across the 5000 genes) that have a significant change
  • Each row in our table is similar to one of those of traditional statistical analysis problems
statistics unpaired statistical experiments

Condition

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments
  • Overall setting: 2 groups of 4 individuals each
    • Group1: Imperial students
    • Group2: UCL students
  • Experiment 1:
    • We measure the height of all students
    • We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant
  • Experiment 2:
    • We measure the weight of all students
    • We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant
  • Experiment 3:
    • ………
statistics unpaired statistical experiments11

Condition

Condition

Group 2 members

Group 1 members

StatisticsUnpaired statistical experiments
  • In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group
  • e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups:
    • One groups takes the drug
    • Another groups takes a placebo
    • You measure (quantify) effect of both groups some time later
  • You want to establish whether there is a significant difference between both groups at that later point
  • The WT/KO example is an unpaired experiment if the rats in the experiments are different !
statistics unpaired statistical experiments12
StatisticsUnpaired statistical experiments
  • The WT/KO example is an unpaired experiment if the rats in the experiments are different!
statistics unpaired statistical experiments13
StatisticsUnpaired statistical experiments
  • How do we address the problem?
  • Compare two sets of results (alternatively calculate mean for each group and compare means)
  • Graphically:
    • Scatter Plots
    • Box plots, etc
  • Compare Statistically
    • Use unpaired t-test

Are these two series significantly different?

Are these two series significantly different?

statistics paired statistical experiments

Condition 1

Condition 2

Group members

StatisticsPaired statistical experiments
  • Overall setting: 1 groups of 4 individuals each
    • Group1: Imperial students
    • We make measurements for each student in two situations
  • Experiment 1:
    • We measure the height of all students before Bioinformatics course and after Bioinformatics course
    • We want to establish if Bioinformatics course consistently (or on average) affects students’ heights
  • Experiment 2:
    • We measure the weight of all students before Bioinformatics course and after Bioinformatics
    • We want to establish if Bioinformatics course consistently (or on average) affects students’ weights
  • Experiment 3:
    • ………
statistics paired statistical experiments15

Condition 1

Condition 2

Group members

StatisticsPaired statistical experiments
  • In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after)
  • e.g. you want to test the effectiveness of a new cream for tanning
    • You measure the tan in each individual before the cream is applied
    • You measure the tan in each individual after the cream is applied
  • You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole
statistics paired statistical experiments16
StatisticsPaired statistical experiments
  • The WT/KO example is a paired experiment if the rats in the experiments are the same!
statistics paired statistical experiments17
StatisticsPaired statistical experiments
  • How do we address the problem?
  • Calculate difference for each pair
  • Compare differences to zero
  • Alternatively (compare average difference to zero)
  • Graphically:
    • Scatter Plot of difference
    • Box plots, etc
  • Statistically
    • Use unpaired t-test

Are differences close to Zero?

statistics significance testing
StatisticsSignificance testing
  • In both cases (paired and unpaired) you want to establish whether the difference is significant
  • Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance.
  • To do this, you need to review some basic statistics
    • Normal distributions: mean, standard deviations, etc
    • Hypothesis Testing
    • t-distributions
    • t-tests and p-values
mean and standard deviation

68% of dist.

1 s.d.

1 s.d.

x

Mean and standard deviation
  • Mean and standard deviation tell you the basic features of a distribution
  • mean = average value of all members of the group

u = (x1+x2+x3 ….+xN)/N

  • standard deviation = a measure of how much the values of individual members vary in relation to the mean
  • The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean
note on s d calculation
Note on s.d. calculation
  • Through the following slides and in the tutorials, I use the following formula for calculating standard deviation
  • Some people use the unbiased form below (for good reasons)
  • Please use the simple form if you want the answers to add up at the end
the normal distribution

68% of dist.

1 s.d.

1 s.d.

x

The Normal Distribution

Many continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in;

  • The x-axis represents the values of a particular variable
  • The y-axis represents the proportion of members of the population that have each value of the variable
  • The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range
normal distribution and confidence intervals

Any normal distribution can be transformed to a standard distribution

(mean 0, s.d. = 1)

using a simple transform

0.025 = p-value: probability of a measurement value belonging to this distribution

Normal Distribution and Confidence Intervals

a/2 = 0.025

a/2 = 0.025

1-a = 0.95

-1.96

1.96

hypothesis testing unpaired are two data sets different

In unpaired experiments, we compare the difference between the means.

Hypothesis Testing: (Unpaired)Are two data sets different
  • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same)
  • We pose a null hypothesis that the means are equal
  • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal)
    • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different)
    • If probability is high (high p) accept null hypothesis (both means are equal)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

comparing two samples graphical interpretation
Comparing Two SamplesGraphical interpretation
  • To compare two groups you can compare the mean of one group graphically.
  • The graphical comparison allows you to visually see the distribution of the two groups.
  • If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups.
  • We can set a critical value for the x-axis based on the threshold of p-value
hypothesis testing paired are two data sets different

In paired experiments, we compare the mean difference.

Hypothesis Testing: (Paired)Are two data sets different
  • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known
  • We pose a null hypothesis that the mean difference is zero
  • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (mean of difference is 0)
    • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (mean of difference <>0)
    • If probability is high (high p) accept null hypothesis (mean of difference is 0)

Ho

Population 1

Population 2

Ha

Population 1

Population 2

If standard deviation known use z test, else use t-test

the t test

Typically known as Student t-test

The t-test
  • In most cases we use what is know as a t-test rather than the z-test when comparing samples.
  • In particular when we have
    • small data sets (less than 30 each) and
    • we don’t know the s.d. and have to calculate it from the small samples
  • Same concepts as before apply, but we base the test on what is known as the t-distribution, which approximates the normal distribution for small samples
  • We have to calculate what is know as a t-value!
the t distribution

We will see how we calculate the degrees of freedom in a short while

The t-distribution
  • In fact we have many t-distributions, each one is calculated in reference to the number of degrees of freedom (d.f.)also know as variables (v)

Normal distribution

t-distribution

t test terminology
t-test terminology
  • t-test: Used to compare the mean of a sample to a known number (often 0).
  • Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal.
  • Test: The hypotheses for a single sample t-test are:
    • Ho: u = u0
    • Ha: u < > u0
  • p-value: probability of error in rejecting the hypothesis of no difference between the two groups.

(where u0 denotes the hypothesized value to which you are comparing a population mean)

t tests terminology single tail vs two tail
t-Tests terminology Single-tail vs. two-tail
  • What am I testing for:
    • Right Tail: (group1 > group2)
    • Left Tail: (group1 < group2)
    • Two Tail: Both groups are different but I don’t care how.

H0: m 1£ m 2

H1: m 1 > m 2

H0: m 1 - m 2£ 0

H1: m 1 - m 2> 0

Right Tail

OR

H1: m 1 < m 2

H0: m 1 - m 2³ 0 H1: m 1 - m 2 < 0

OR

Left Tail

H0: m 1³ m 2

H0: m 1 = m 2H1: m 1¹m 2

H0: m 1 -m 2 = 0 H1: m 1 - m 2 ¹ 0

Two Tail

OR

t test terminology unpaired vs paired t test
t-test terminologyUnpaired vs. paired t-test
  • Same as before !! Depends on your experiment
  • Unpaired t-Test: The hypotheses for the comparison of two independent groups are:
    • Ho: u1 = u2 (means of the two groups are equal)
    • Ha: u1 <> u2 (means of the two group are not equal)
  • Paired t-test: The hypothesis of paired measurements in same individuals
    • Ho: D = 0 (the difference between the two observations is 0)
    • Ha: D <> 0 (the difference is not 0)
calculating t test t statistic

Where d is calculated by

Remember these formulae !!

Calculating t-test (t statistic)
  • First calculate t statistic value and then calculate p value

For the paired t-test, t is calculated using the following formula:

And n is the number of pairs being tested.

  • For an unpaired (independent group) t-test, the following formula is used:

Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.

calculating p value for t test
Calculating p-value for t-test
  • When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’.
  • There are three methods for calculating P:
    • One Tailed >:
    • One Tailed <:
    • Two Tailed:
  • Where p(t,v) is looked up from the t-distribution table
  • The number of degrees (v) of freedom is calculated as:
    • UnPaired: n(x) +n (y) -2
    • Paired: n- 1 (where n is the number of pairs.)
p values
p-values
  • Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative.
  • In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.
calculating t and p values
Calculating t and p values
  • You will usually use a piece of software to calculate t and P
    • (Excel provides that !).
  • In a problems
    • You can assume access to a function p(t,v) which calculates p for a given t value and v (number of degrees of freedom)
    • or alternatively have a table indexed by critical t values and v
t value and p value
t-value and p-value
  • Given a t-value, and degrees of freedom, you can look-up a p-value
  • Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t
t test interpretation

Reject H

Reject H

0

0

.025

.025

t

-2.0154

0

2.0154

t-test Interpretation

Note as t increases, p decreases

t (value) must > t (critical on table) by P level

finding a critical t

A

A

= .05

= .05

-tc

Finding a critical t
  • The table provides the t values (tc) for which P(tx > tc) = A

tc

=-1.812

=1.812

t.100

t.05

t.025

t.01

t.005

summary
Summary
  • Differential analysis
    • Uses fold ratio (fold change) for measuring effect
    • Need some measure of significance of such effect.
  • Statistical analysis
    • Paired vs. unpaired experiments
  • t-tests
    • Calculating t for paired/un-paired experiments
    • Deciding single tail vs. two-tail
    • Calculating degrees of freedom
    • Look-up p value