1 / 28

Integrating Biology and Statistics: Gene Set Methods

Integrating Biology and Statistics: Gene Set Methods. BIOS 691-003 Winter/Spring 2010. Philosophical Overture. Integrating biology and statistics Gene sets: genes whose protein products collaborate on a well-defined function Vague! Hard to define ‘function’ or draw boundary on ‘gene sets’

keahi
Download Presentation

Integrating Biology and Statistics: Gene Set Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Biology and Statistics: Gene Set Methods BIOS 691-003 Winter/Spring 2010

  2. Philosophical Overture • Integrating biology and statistics • Gene sets: genes whose protein products collaborate on a well-defined function • Vague! • Hard to define ‘function’ or draw boundary on ‘gene sets’ • Statistical methods often ad-hoc • Be skeptical... but optimistic

  3. Historical Motivations • Too many genes are significant • Researchers used to generate a list by p-value and comb for genes that work together • First pathway tools automated this process • Patterns may be more significant than any individual gene • e.g. if most genes in glycogen biosynthesis are up, but none is significant individually (after multiple-comparisons adjustment) • We can infer that glycogen is being made

  4. Goals of Current Practice • Characterize biological meaning of joint changes in gene expression • Organize expression (or other) changes into meaningful ‘chunks’ (themes) • Identify crucial points in process where intervention could make a difference

  5. Gene Sets • Gene Ontology • Biological Process • Molecular Function • Cellular Location • Pathway Databases • KEGG • BioCarta • MSIGDB • Broad Institute

  6. Approaches • Univariate (most of current practice): • Discrete methods based on counting • Continuous methods: summarize gene test statistics by set • Multivariate (promising but unclear): • Compare differences to normal covariation of genes in groups across individuals • Use known biological relationships to construct test statistics

  7. Univariate Approaches • Discrete tests: enrichment for groups in gene lists • Select genes differentially expressed at some cutoff • For each gene group cross-tabulate • Test for significance (Hypergeometric or Fisher test) • Continuous tests: from gene scores to group scores • Compare distribution of scores within each group to random selections • GSEA (Gene Set Enrichment Analysis) • PAGE (Parametric Analysis of Gene Expression)

  8. Discrete Approach – 2 x 2 Table • For each set in turn construct 2 x 2 table of significance vs membership in set: P =

  9. Significance Testing of Categories • Fisher’s Exact Test • Condition on margins fixed • Of all tables with same margins, how many have dependence as or more extreme? • Hard to compute when either n or k are large • Approximations • Binomial (when k/n is small) • Chi-square (when expected values > 5 ) • G2 (log-likelihood ratio; compare to c2 on 1 df)

  10. Practical Issues – I • What is appropriate Null Distribution? • Highly correlated because many overlaps • Must do permutation analysis • How to permute? • Random sets of genes? Or • Random assignments of samples? • P-value or FDR? • Heuristic method • More constrained by annotation than statistics

  11. Practical Issues – II • If a child category is declared significant, how to assess significance of parent category? • Include child category • Consider only genes external to child • In practice big categories are not useful • Small categories may not be well represented on chip • Select categories in middle range: 5-20 represented on chip

  12. Critiques of Discrete Approach • No use of information about size of change • Large t scores count like small t’s • Continuous procedures have more power than discrete procedures on discretized continuous data

  13. GSEA (Gene Set Enrichment Analysis) • Introduced in 2003 by Mootha to address a puzzle in a diabetes data set • No genes significant individually • But Oxidative Phosphorylation mostly up • GSEA tests rank of genes in a gene set against randomly distributed ranks • Kolmogorov-Smirnov test: • Maximum difference between ranks of genes in set and uniform distribution

  14. Based on statistics of ‘Brownian Bridge’ random walk fixed end Maximum difference is test statistic Null distribution known Reformulated by GSEA as difference of CDF – uniform from axis Kolmogorov-Smirnov Test

  15. GSEA

  16. K-S Test Finds Irrelevant Sets • Sometimes ranks concentrated in middle • K-S statistic high, but not meaningful for path change • Fix: ad-hoc weighting by actual t-scores emphasizes departures at extreme ends • No theory • Generate null distribution by permutation

  17. Group Z- or T- Scores • PAGE: log fold-changes over all genes follow ‘close to’ Normal distribution • Can estimate s from overall distribution • T-Profiler: under Null Hypothesis, each gene’s t-score follows t distribution ‘near’ N(0,1) distribution • Hence the sum over genes in a specific set G: • PAGE: T-profiler: • If most genes in a pathway are up-regulated then gene set scores will be significantly high

  18. Issues and Critiques • Same issues as discrete approach • Null distribution by permuting samples • GSEA finally gets that right in 2005 • Null distribution for Z-test assumes IID • Methods assume all meaningful changes in same direction • Don’t use information about normal co-variation

  19. Why Is Covariation Important? • Most cellular processes are homeostatic: • They find a good functional set-point • Coping with variation in inputs … • … AND in specific regulatory couplings • Most of us have regulatory SNP’s that vary expression by a factor of two or more • Other genes are expressed at somewhat different levels to accommodate key processes

  20. Multivariate Approaches • Classical multivariate methods • Multi-dimensional Scaling • Hotelling’s T2 • Machine learning approaches • Topological score relative to network • Prediction by machine learning tool • e.g. ‘random forest’

  21. PCA PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation. Three correlated variables

  22. Multi-Dimensional Scaling • Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions • Algorithm: • Transform distances into cross-product matrix • Initial PCA onto 2 (or 3) axes • Deform until better representation • Minimize ‘strain’ measure:

  23. Separating Using MDS Left: distributions of individual variables Right: MDS plot (in this case PCA)

  24. MDS for Pathways • BAD pathway: controlled cell death Normal IBC Other BC • Clear separation between groups • Cancer samples don’t have coherent variation

  25. Hotelling’s T2 • Compute distance between sample means using (common) metric of covariation • Where • Multidimensional analog of t (actually F) statistic

  26. Principles of Kong et al Method • Normal covariation generally acts to preserve homeostasis • The transcription of genes that participate in many processes will be changed • The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

  27. Issues • Not robust to outliers • In practice this may not matter much (?) • Assumes same covariance in each sample • Small samples -> unreliable S estimates • Loss of power • Robust / Regularized Methods improve sensitivity by up to a factor of 10! • Yates & Reimers (in prep)

  28. Overall Assessment • Gene sets are somewhat arbitrary • Most ‘modules’ overlap extensively with others • Many ‘modules’ act by protein modification rather than gene expression • Current methods represent a first attempt to bring biological information to bear on the significance problem

More Related