1 / 52

Gene Expression Data Analyses (3)

Gene Expression Data Analyses (3). Trupti Joshi Computer Science Department 317 Engineering Building North E-mail: joshitr@missouri.edu 573-884-3528(O). Lecture Outline -1. Statistical significance vs. biological relevance Statistical methods Two sample statistical tests

Download Presentation

Gene Expression Data Analyses (3)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Expression Data Analyses (3) Trupti Joshi Computer Science Department 317 Engineering Building North E-mail: joshitr@missouri.edu 573-884-3528(O)

  2. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  3. Lecture Outline -2 • Data interpretation • Selection of softwares • Image analysis • Imagene • Statistical analysis • GeneSpring • SAM • ArrayStat

  4. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  5. Why Statistical Analysis? • Rank results by confidence with significance metrics (e.g. p-value) • Estimate the false positive (Type I errors) and false negatives (Type II errors) • Achieve the desired balance of sensitivity and specificity • Result in a certain amount of flexibility (and arbitrariness) when interpreting significance metrics generated by a test

  6. Statistical Significance vs. Biological Relevance

  7. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  8. Normal Distribution • Central peak: mean • Symmetrical

  9. Parametric Analysis • Test the hypothesis that one or more treatments have no effect on the mean and variance of a chosen variable • Assume yield data as a normal distribution • Disadvantages: If the yield is not normally distributed.

  10. Non-parametric Analysis • Use ranks of numerical data rather than the data themselves • Use information about the relative sizes of observations, without making any assumptions about the means and variances of the populations being tested • Can be used for any data set • Disadvantages: if the data is normally distributed, it is less powerful than parametric analysis

  11. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  12. T test • Paired t test: • the size of two groups should be same • Comparison for organism before or after treatment (before and after heat shock) • Unpaired t test: • the size of two groups do not need to be same • Comparison between organisms with treatment or non-treatment

  13. How to Perform T test Paired T-test Un-Paired T-test

  14. T-test example Unpaired T test Paired T test

  15. Mann-Whitney Test • Use if sample is not distributed normally • Similar to non-paired T test but non-parametric • Use the rankings of the numerical values instead of variance

  16. Mann-Whitney Test--example

  17. Wilcoxon Signed-Rank Test • Use if sample is not distributed normally • Similar to paired T test but non-parametric • Rank the difference between arrays • If the difference between two pairs is 0, the value is not used • If the difference is identical between 2 pairs, the average rank of the two groups is used • Use Wilcoxon Table

  18. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  19. ANOVA (Analysis of Variance) • A parametric test • Assumes a normal distribution • The variance in the groups must be equal • The data points in each group must be from independent samples • If only two groups, ANOVA is equivalent to T test

  20. Perform ANOVA • Two estimates of variance are taken • Estimate the variance within the group based on the standard deviation of each group • Estimate the variance among groups based on the variability between means of each group

  21. One-Way ANOVA

  22. One-Way ANOVA-example

  23. Two-Way ANOVA

  24. Two-Way ANOVA--example

  25. Kruskal-Wallis • Non-parametric equivalent to ANOVA • Using Chi-square distribution with k-1 degrees of freedom

  26. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  27. Multiple Comparison Corrections • When the sample size increases, the number for significance will be increased. • The number of false positives (Type I errors) may increase as well. • To fix this problem, some sort of adjustment of p-values or -levels

  28. Let, k = the number of groups; K = the number of comparisons that are necessary Each subsequent column represents the chosen level of significance. Increased likelihood of generating Type I error by performing multiple pair-wise comparisons

  29. Bonferroni Correction • The cut-off level of significance being used is divided by the number of means being compared. • In stead of testing each hypothesis at level , test each at level /m. • Good for a small number of samples • May be too conservative

  30. False Discovery Rate • Multiple test controls Prob(V1) • M is huge=> falsely rejected (Type II error) are likely to occur • Better to control • Intuitive definition of false discovery rate: • Compared to Bonferroni: • Bonferroni fixed error rate: estimated rejection area • FDR fixed rejection error: estimated rejection error

  31. Two Algorithms for FDR • Benjamin and Hochberg: • The rate that false discoveries occur • Fix a cutoff *, and then derive a decision rule that achieves FDR* • Storey: • The rate that discoveries are false • Fix a decision rule, and then estimate the FDR associated with using this decision rule • Estimate m0

  32. Lecture Outline -1 • Statistical significance vs. biological relevance • Statistical methods • Two sample statistical tests • Parametric: T-test (paired and unpaired t test) • Non-parametric: • Mann-Whitney test for independent samples • Wilcoxon signed-rank test for paired data • Multivariate statistics • One-way vs Two-way analysis of variance (ANOVA) • Kruskal-Wallis • Multiple comparison corrections • Bonferroni Correction • False Discovery Rate

  33. Lecture Outline -2 • Data interpretation • Selection of softwares • Image analysis • Imagene • Statistical analysis • GeneSpring • SAM • ArrayStat

  34. Lecture Outline -2 • Data interpretation • Selection of softwares • Image analysis • Imagene • Statistical analysis • GeneSpring • SAM • ArrayStat

  35. How to Interpret Expression Profiling Data • Overlay functional information and allow biological context to help decide what is of interest and what is not • Using computational methods (classification, clustering, promoter prediction, etc.) • Data mining tools • Public identifier: GenBank, Swiss-prot, Gene Ontology (GO) • Using database: LocusLink, HomologGene, RefSeq, UniGene, etc. • GeneFAS (Digbio), GenePath (Digbio), NetAffx, etc.

  36. Gene Ontology (GO) • Most commonly used public domain sources of gene classification • Provide controlled vocabulary hierarchies for • molecular function • biological process • cellular component

  37. GO

  38. Current GO annotation • http://www.geneontology.org/GO.current.annotations.shtml • More than 30 species are listed

  39. Lecture Outline -2 • Data interpretation • Selection of softwares • Image analysis • Imagene • Statistical analysis • GeneSpring • SAM • ArrayStat

  40. Image Analysis • More 20 softwares are listed at http://ihome.cuhk.edu.hk/~b400559/arraysoft_image.html • Imagene (BioDiscovery, Inc.)

  41. Imagene Analysis

  42. Flagging Spot

  43. Defining Thresholds for Empty Spots

  44. Lecture Outline -2 • Data interpretation • Selection of softwares • Image analysis • Imagene • Statistical analysis • GeneSpring • SAM • ArrayStat

  45. GeneSpring • GeneSpring (Silicon Genetics) • Broadly used • Nice user interface • Data Normalization (Lowess, etc.) • Powerful ANOVA statistical analysis • t-test/1-way ANOVA test • 2-way ANOVA tests • 1-way post-hoc tests for reliably identifying differentially expressed genes • Incorporation of different analysis tools • Clustering • Visual filtering • Pathway viewing • Scripting

  46. ANOVA in GeneSpring (I) • Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal. • Technical replicates are on different slides + Biological replicates (e.g. as in the case of one-color arrays) • GeneSpring does not make the distinction between technical sample and biological sample replicates

  47. ANOVA in GeneSpring (II) • Use Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal.  • The on-chip variance is being ignored. • Technical replicates are spotted on a chip (i.e. on-chip replicates) + biological replicates  • e.g.  If you have 3 sets of on-chip replicates X 2 biological replicates for group A, same set up for group B.  • GeneSpring will first average the on-chip replicates.  Now, you have the average on-chip value for replicate #1 and another average for the on-chip values for replicate #2.   Then, GeneSpring uses these two final averages to compute ANOVA.  The df is 2-1.

  48. ANOVA in GeneSpring (III) • Use Tools -> Statistical Analysis -> test type: parametric, use all available error measurements. • In this case, both the on-chip and biological replicate information are used. • Technical replicates are spotted on a chip (i.e.. on-chip replicates) + biological replicates • If you have 3 sets of on-chip replicates X 2 biological replicates for group A, same set up for group B.  • GeneSpring will take on-chip and biological variance into account when calculating the ANOVA.  The degree of freedom will also account for both types of replicates.  The equation for the degree of freedom is actually quite complex, because GeneSpring takes the standard error of the on-chip and biological replicates into consideration.  This is done so that different levels of variations between technical and biological replicates will be accounted for.

  49. Error correction • P-value Cutoff/False discovery rate: 0.05 • Multiple testing correction: Too conservative. Use None • Post-Hoc testing: Used for 3 more more conditions. • Showing the pairing conditions between which the significant changes are detected.

  50. Statistical Analysis of Microarray (SAM) • From Stanford (http://www-stat.stanford.edu/~tibs/SAM/) • Correlates gene expression data to a wide variety of clinical parameters including treatment, diagnosis categories, survival time and time trends • Provides estimate of False Discovery Rate for multiple testing • Automatic imputation of missing data via nearest neighbor algorithm • Can deal with blocked designs, for example, when treatments are applied within different batches of arrays • Convenient Excel Add-in • Works with data from both cDNA and oligo microarrays. Can also be applied to protein expression data and SNP chip data. • Genes are web-linked to Stanford SOURCE database

More Related