Statistical Machine Learning and Computational Biology

132 Views

Download Presentation
## Statistical Machine Learning and Computational Biology

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Statistical Machine Learning and Computational Biology**• Michael I. Jordan • University of California, Berkeley • November 5, 2007**Statistical Modeling in Biology**• Motivated by underlying stochastic phenomena • thermodynamics • recombination • mutation • environment • Motivated by our ignorance • evolution of molecular function • protein folding • molecular concentrations • incomplete fossil record • Motivated by the need to fuse disparate sources of data**Outline**• Graphical models • phylogenomics • Nonparametric Bayesian models • protein backbone modeling • multi-population haplotypes • Sparse regression • protein folding**Probabilistic Graphical Models**p(x3| x2) p(x2| x1) X3 X2 p(x1) X1 X6 p(x6| x2, x5) p(x4| x1) p(x5| x4) X4 X5 • The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edgesE : • p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5) • Given a graphG= (V,E), where each nodevÎVis associated with a random variableXv**Inference**• Conditioning • Marginalization • Posterior probabilities**Inference Algorithms**• Exact algorithms • sum-product • junction tree • Sampling algorithms • Metropolis-Hastings • Gibbs sampling • Variational algorithms • mean-field • Bethe, Kikuchi • convex relaxations**Hidden Markov Models**• Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields**Phylogenies**• The shaded nodes represent the observed nucleotides at a given site for a set of organisms • Site independence model (note the plate) • The unshaded nodes represent putative ancestral nucleotides • Computing the likelihood involves summing over the unshaded nodes**Hidden Markov Phylogeny**• This yields a gene finder that exploits evolutionary constraints • Evolutionary rate is state-dependent • (edges from state to nodes in phylogeny are omitted for simplicity) • Based on sequence data from 12-15 primate species, we obtain a nucleotide sensitivity of 100%, with a specificity of 89% • GENSCAN yields a sensitivity of 45%, with a specificity of 34%**Annotating new genomes**Species: Aspergillus nidulans (Fungal organism) >Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSV ETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQV HMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGL DGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGAT RIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF What molecular function does protein Q8X1T6 have? Images courtesy of Broad Institute, MIT**Annotation Transfer**BLAST Search: Q8X1T6 (Aspergillus nidulans) • Species Name Molecular Function Score E-value • Schizosaccharomyces pomb adenosine deaminase 390 e-107 • Gibberella zeae hypothetical protein FG01567.1 345 7e-94 • Saccharomyces cerevisiae adenine deaminase 308 1e-82 • Wolinella succinogenes putative adenosine deaminase 268 1e-70 • Rhodospirillum rubrum adenosine deaminase 266 6e-70 • Azotobacter vinelandii adenosine deaminase 260 4e-68 • Streptomyces coelicolor probable adenosine deaminase 254 2e-68 • Caulobacter crescentus CB1 adenosine deaminase 253 5e-66 • Streptomyces avermitilis putative adenosine deaminase 251 2e-65 • Ralstonia solanacearum adenosine deaminase 251 2e-65 • environmental sequence unknown 246 5e-64 • Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63 • Pseudomonas aeruginosa adenosine deaminase 245 1e-63 • environmental sequence unknown 244 3e-63 • Pseudomonas fluorescens adenosine deaminase 243 7e-63 • Pseudomonas putida KT2440 adenosine deaminase 243 7e-63**Species Name Molecular Function Score E-value**Schizosaccharomyces pombe adenosine deaminase 390 e-107 Gibberella zeae hypothetical protein FG01567.1 345 7e-94 Saccharomyces cerevisiae adenine deaminase 308 1e-82 Wolinella succinogenes putative adenosine deaminase 268 1e-70 Rhodospirillum rubrum adenosine deaminase 266 6e-70 Azotobacter vinelandii adenosine deaminase 260 4e-68 Streptomyces coelicolor probable adenosine deaminase 254 2e-68 Caulobacter crescentus adenosine deaminase 253 5e-66 Streptomyces avermitilis putative adenosine deaminase 251 2e-65 Ralstonia solanacearum adenosine deaminase 251 2e-65 environmental sequence unknown 246 5e-64 Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63 Pseudomonas aeruginosa adenosine deaminase 245 1e-63 environmental sequence unknown 244 3e-63 Pseudomonas fluorescens adenosine deaminase 243 7e-63 Pseudomonas putida KT adenosine deaminase 243 7e-63 MP Methodology to System: SIFTER Set of homologous proteins (Pfam) Gene Tree Species Tree adenine adenine adenosine Gene Ontology SIFTER adenosine adenosine**2-5 different functions**>51 different functions 21-50 different functions 11-20 different functions 6-10 different functions Functional diversity problem 1887 Pfam-A families with more than two experimentally characterized functions**Available methods for comparison**• Sequence similarity methods • BLAST[Altschul 1990]: sequence similarity search, transfer annotation from sequence with most significant similarity • Runs against largest curated protein database in world • GOtcha[Martin 2004]: BLAST search on seven genomes with GO functional annotations • GOtcha runs use all available annotations • GOtcha-exp runs use only available experimental annotations • Sequence similarity plus bootstrap orthology • Orthostrapper[Storm 2002]: transfer annotation when query protein is in statistically supported orthologous cluster with annotated protein**AMP/adenosine deaminase**• 251 member proteins in Pfam v. 18.0 • 13 proteins with experimental evidence GOA • 20 proteins with experimental annotations from manual literature search • 129 proteins with electronic annotations from GOA • Molecular function: remove amine group from base of substrate • Alignment from Pfam family seed alignment • Phylogeny built with PAUP* parsimony, BLOSUM50 matrix Mouse adenosine deaminase, courtesy PDB**AMP/adenosine deaminase**SIFTER Errors Leave-one-out cross-validation: 93.9% accuracy (31 of 33) BLAST: 66.7% accuracy (22 of 33)**AMP/adenosine deaminase**Note: x-axis is on log scale Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot**Sulfotransferases: ROC curve**• SIFTER (no truncation): 70.0% accuracy (21 of 30) • BLAST: 50.0% accuracy (15 of 30) Note: x-axis is on log scale**Nudix Protein Family**• 3703 proteins in the family • 97 proteins with molecular • functions characterized • 66 different candidate • molecular functions**Nudix: SIFTER vs BLAST**• SIFTER truncation level 1: 47.4% accuracy (46 of 97) • BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at • all in search results**Trade specificity for accuracy**• Leave-one-out cross-validation, truncation at 1: 47.4% accuracy 15 candidate functions 66 candidate functions Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy**Fungal genomes**Euascomycota Hemiascomycota Archeascomycota Basidiomycota Zygomycota Work with Jason Stajich; Images courtesy of Broad Institute**Fungal Genomes Methods**• Gene finding in all 46 genomes • hmmsearch for all 427,324 genes • Aligned hits with hmmalign to 2,883 Pfam v. 20 families • Built trees using PAUP* maximum parsimony for 2,883 Pfam v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b) experimental and electronic annotations**Clustering**• There are many, many methodologies for clustering • Heuristic methods • hierarchical clustering • M estimation • K means • spectral clustering • Model-based methods • finite mixture models • Dirichlet process mixture models**Nonparametric Bayesian Clustering**• Dirichlet process mixture models are a nonparametric Bayesian approach to clustering • They have the major advantage that we don’t have to assume that we know the number of clusters a priori**Chinese Restaurant Process (CRP)**• Customers sit down in a Chinese restaurant with an infinite number of tables • first customer sits at the first table • th subsequent customer sits at a table drawn from the following distribution: • where is the number of occupants of table**The CRP and Mixture Models**• The customers around a table form a cluster • associate a mixture component with each table • the first customer at a table chooses from the prior • e.g., for Gaussian mixtures, choose • It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable 1 2 3 4**Dirichlet Process**• Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process 0 1**Dirichlet Process Mixture Models**• Given observations, we model each with a latent factor: • We put a Dirichlet process prior on :**Connection to the Chinese Restaurant**• The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process • Let’s now consider how to build on these ideas and solve the multiple clustering problem**Multiple Clustering Problems**• In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems • Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: • shrinkage • empirical Bayes • hierarchical Bayes • Does this multiple group problem arise in clustering? • I’ll argue “yes!” • If so, how do we “shrink” in clustering?**Multiple Data Analysis Problems**• Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean: • Maximum likelihood estimates of are obtained independently • This often isn’t what we want (on theoretical and practical grounds)**Hierarchical Bayesian Models**• Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the**Protein Backbone Modeling**• An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids • For each amino acid, it turns out that two angles suffice; traditionally called φandψ • A plot of φandψ angles across some ensemble of amino acids is called a Ramachandran plot**A Ramachandran Plot**• This can be (usefully) approached as a mixture modeling problem • Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid**Ramachandran Plots**• But that plot is an overlay of 400 different plots, one for each combination of 20 amino acids on the left and 20 amino acids on the right • Shouldn’t we be treating this as a multiple clustering problem?**Haplotype Modeling**• A haplotype is the pattern of alleles along a single chromosome • Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes: • Need to restore haplotypes from genotypes • A genotype is well modeled as a mixture model, where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)**Multiple Population Haplotype Modeling**• When we have multiple populations (e.g., ethnic groups) we have multiple mixture models • How should we analyze these data (which are now available, e.g., from the HapMap project)? • Analyze them separately? Lump them together?**Hidden Markov Models**• An HMM is a discrete state space model • The discrete state can be viewed as a cluster indicator • We thus have a set of clustering problems, one for each value of the previous state (i.e., for each row of the transition matrix)**Solving the Multiple Clustering Problem**• It’s natural to take a hierarchical Bayesian approach • It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori • How do we do this?**Hierarchical Bayesian Models**• Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the**Hierarchical DP Mixture Model?**• Let us try to model each group of data with a Dirichlet process mixture model • let the groups share an underlying hyperparameter • But each group is generated independently • different groups cannot share the same components if is continuous. spikes do not match up