Introduction to Bioinformatics 6 . Statistical Analysis of Gene Expression Matrices II

Introduction to Bioinformatics6. Statistical Analysis of Gene Expression Matrices II Course 341 Department of Computing Imperial College, London Moustafa Ghanem

Lecture Overview • Motivation • Get a feel for t-values and how they change • Volcano plots • Visual method for differential gene expression analysis • Meaning of x and y axes • Interpretation of results

Interpretation of t-test • The higher the t-value, the lower the p-value, the more confident you are

Where d is calculated by Remember these formulae !! Calculating t-test (t statistic) • First calculate t statistic value and then calculate p value For the paired t-test, t is calculated using the following formula: And n is the number of pairs being tested. • For an unpaired (independent group) t-test, the following formula is used: Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.

Calculating and Interpreting t-values Consider the following examples, and assume a paired experiment:

High t-value • Take Gene A, assuming paired test: • For Either type of test • Average Difference is = 100, SD. = 0 • t value is near infinity, • p is extremely low

Where d is calculated by Consider Gene M for a paired experiment • Average Difference is = 0 • t value is zero, what does this mean?

Where d is calculated by Consider Gene T for a paired experiment

t = Mean of differences Value S.D. of differences/sqrt(n) d4 d3 d1 d2 d =Diff Sample ID davg d =Diff davg Sample ID Sample ID Graphical Interpretation of t-test (Paired) • t-value = Signal/Noise ratio Value d4 d3 d2 Sample ID Case1: Low Variation around mean of differences Case2: Moderate Variation around mean of differences

Value d4 d3 d1 d2 d =Diff Sample ID davg Sample ID Graphical Interpretation of t-test (Paired) Case3: Large Variation around mean of differences

Back to our problem 4 Wild KO samples (Red) Columns represent samples 4 Wild Type samples (Blue) 5000 Rows represent genes

Hypothesis Testing • Uses hypothesis testing methodology. • For each Gene (>5,000) • Pose Null Hypothesis (Ho) that gene is not affected • Pose Alternative Hypothesis (Ha) that gene is affected • Use statistical techniques to calculate the probability of rejecting the hypothesis (p-value) • If p-value < some critical value reject Ho and Accept Ha • The issues: • Large number of genes (or experiments) • Need quick way to filter out significant genes that have high fold change • Need also to sort genes by fold change and significance

For each gene compare the value of the effect between population WT vs. KO (fold change) For each gene calculate the significance of the change (t-test, p-value) Identify Genes with high effect and high significance Volcano Plot Volcano Plots Volcano plots are a graphical means for visualising results of large numbers of t-tests allowing us to plot both the Effect and significance of each test in an easy to interpret way

Effect = log(WT) – log(KO) 2 2 = log(WT / KO) 2 Volcano plots • In a volcano plot: • X-axis represents effect measured as fold change: If WT = WO, Effect Fold Change = 0 , If WT = 2 WO, Effect Fold Change = 1 ...

Numerical Interpretation (Effect) Effect has doubled 21 (2 raised to the power of 1) Two Fold Change Effect has halved 20.5 (2 raised to the power of 0.5) Using log2 for X axis:

Volcano plots • In a volcano plot: • y-axis represents the number of zeroes in the p-value • (remember with a p-value of 0.0001, you are more confident than with a p-value of 0.01 • This is just a trick so that higher values on the graph are more important Calculate Significance as – log (p_value) 10 If p = 0.1, -log(0.1) = 1 (1 decimal point) If p = 0.01, -log (0.01) = 2 (2 decimal points) ...

Numerical Interpretation (Significance) p< 0.01 (2 decimal places) p< 0.1 (1 decimal place) Using log10 for Y axis:

Choosing log scales is a matter of convenience Effect can be both +ve or -ve Visualise the Result :Volcano Plot High Significance • Effect vs. Significance • Selections of items that have both a large effect and are highly significant can be identified easily. High Effect & Significance Boring stuff Low Significance -ve effect +ve effect

Summary • t-Test good for small samples (in our case 4 paired observations) • t distribution approximates to normal distribution when degrees of freedom > 30 • Remember formulae for paired/un-paired • Volcano plot simple method for visualising large sets of such observations • Remember formula for x-axis • Remember formula for y-axi

Introduction to Bioinformatics 6 . Statistical Analysis of Gene Expression Matrices II

Introduction to Bioinformatics 6 . Statistical Analysis of Gene Expression Matrices II

Presentation Transcript

Introduction to Bioinformatics II

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Analysis of Gene Expression - Overview -

Introduction to Bioinformatics II

Analysis of Gene Expression Data

Serial Analysis of Gene Expression

Gene Expression Analysis

Structured statistical modelling of gene expression data

Introduction to Bioinformatics II

Introduction to Bioinformatics II

Introduction to Statistical Analysis of Gene Expression Data

Introduction to Microarray Gene Expression

Bioinformatics for “Gene Expression Analysis in Diagnostic Medicine”

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Lecture 20 Gene expression and the transcriptome I Introduction to Bioinformatics

Statistical analysis of expression data:

Introduction to Gene Expression

Appendix II: Introduction to Matrices

Bioinformatics : Gene Expression Data Analysis