# Statistical Machine Learning and Computational Biology - PowerPoint PPT Presentation

Statistical Machine Learning and Computational Biology

1 / 67
Statistical Machine Learning and Computational Biology

## Statistical Machine Learning and Computational Biology

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Statistical Machine Learning and Computational Biology • Michael I. Jordan • University of California, Berkeley • November 5, 2007

2. Statistical Modeling in Biology • Motivated by underlying stochastic phenomena • thermodynamics • recombination • mutation • environment • Motivated by our ignorance • evolution of molecular function • protein folding • molecular concentrations • incomplete fossil record • Motivated by the need to fuse disparate sources of data

3. Outline • Graphical models • phylogenomics • Nonparametric Bayesian models • protein backbone modeling • multi-population haplotypes • Sparse regression • protein folding

4. Part 1: Graphical Models

5. Probabilistic Graphical Models p(x3| x2) p(x2| x1) X3 X2 p(x1) X1 X6 p(x6| x2, x5) p(x4| x1) p(x5| x4) X4 X5 • The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edgesE : • p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5) • Given a graphG= (V,E), where each nodevÎVis associated with a random variableXv

6. Inference • Conditioning • Marginalization • Posterior probabilities

7. Inference Algorithms • Exact algorithms • sum-product • junction tree • Sampling algorithms • Metropolis-Hastings • Gibbs sampling • Variational algorithms • mean-field • Bethe, Kikuchi • convex relaxations

8. Hidden Markov Models • Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields

9. Hidden Markov Model Variations

10. Phylogenies • The shaded nodes represent the observed nucleotides at a given site for a set of organisms • Site independence model (note the plate) • The unshaded nodes represent putative ancestral nucleotides • Computing the likelihood involves summing over the unshaded nodes

11. Hidden Markov Phylogeny • This yields a gene finder that exploits evolutionary constraints • Evolutionary rate is state-dependent • (edges from state to nodes in phylogeny are omitted for simplicity) • Based on sequence data from 12-15 primate species, we obtain a nucleotide sensitivity of 100%, with a specificity of 89% • GENSCAN yields a sensitivity of 45%, with a specificity of 34%

12. Annotating new genomes Species: Aspergillus nidulans (Fungal organism) >Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSV ETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQV HMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGL DGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGAT RIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF What molecular function does protein Q8X1T6 have? Images courtesy of Broad Institute, MIT

15. 2-5 different functions >51 different functions 21-50 different functions 11-20 different functions 6-10 different functions Functional diversity problem 1887 Pfam-A families with more than two experimentally characterized functions

16. Available methods for comparison • Sequence similarity methods • BLAST[Altschul 1990]: sequence similarity search, transfer annotation from sequence with most significant similarity • Runs against largest curated protein database in world • GOtcha[Martin 2004]: BLAST search on seven genomes with GO functional annotations • GOtcha runs use all available annotations • GOtcha-exp runs use only available experimental annotations • Sequence similarity plus bootstrap orthology • Orthostrapper[Storm 2002]: transfer annotation when query protein is in statistically supported orthologous cluster with annotated protein

17. AMP/adenosine deaminase • 251 member proteins in Pfam v. 18.0 • 13 proteins with experimental evidence GOA • 20 proteins with experimental annotations from manual literature search • 129 proteins with electronic annotations from GOA • Molecular function: remove amine group from base of substrate • Alignment from Pfam family seed alignment • Phylogeny built with PAUP* parsimony, BLOSUM50 matrix Mouse adenosine deaminase, courtesy PDB

18. AMP/adenosine deaminase SIFTER Errors Leave-one-out cross-validation: 93.9% accuracy (31 of 33) BLAST: 66.7% accuracy (22 of 33)

19. AMP/adenosine deaminase Note: x-axis is on log scale Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot

20. Sulfotransferases: ROC curve • SIFTER (no truncation): 70.0% accuracy (21 of 30) • BLAST: 50.0% accuracy (15 of 30) Note: x-axis is on log scale

21. Nudix Protein Family • 3703 proteins in the family • 97 proteins with molecular • functions characterized • 66 different candidate • molecular functions

22. Nudix: SIFTER vs BLAST • SIFTER truncation level 1: 47.4% accuracy (46 of 97) • BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at • all in search results

23. Trade specificity for accuracy • Leave-one-out cross-validation, truncation at 1: 47.4% accuracy 15 candidate functions 66 candidate functions Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy

24. Fungal genomes Euascomycota Hemiascomycota Archeascomycota Basidiomycota Zygomycota Work with Jason Stajich; Images courtesy of Broad Institute

25. Fungal Genomes Methods • Gene finding in all 46 genomes • hmmsearch for all 427,324 genes • Aligned hits with hmmalign to 2,883 Pfam v. 20 families • Built trees using PAUP* maximum parsimony for 2,883 Pfam v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b) experimental and electronic annotations

26. SIFTER Predictions by Species

27. Part 2: Nonparametric Bayesian Models

28. Clustering • There are many, many methodologies for clustering • Heuristic methods • hierarchical clustering • M estimation • K means • spectral clustering • Model-based methods • finite mixture models • Dirichlet process mixture models

29. Nonparametric Bayesian Clustering • Dirichlet process mixture models are a nonparametric Bayesian approach to clustering • They have the major advantage that we don’t have to assume that we know the number of clusters a priori

30. Chinese Restaurant Process (CRP) • Customers sit down in a Chinese restaurant with an infinite number of tables • first customer sits at the first table • th subsequent customer sits at a table drawn from the following distribution: • where is the number of occupants of table

31. The CRP and Mixture Models • The customers around a table form a cluster • associate a mixture component with each table • the first customer at a table chooses from the prior • e.g., for Gaussian mixtures, choose • It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable 1 2 3 4

32. Example: Mixture of Gaussians

33. Dirichlet Process • Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process 0 1

34. Dirichlet Process Mixture Models • Given observations, we model each with a latent factor: • We put a Dirichlet process prior on :

35. Connection to the Chinese Restaurant • The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process • Let’s now consider how to build on these ideas and solve the multiple clustering problem

36. Multiple Clustering Problems • In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems • Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: • shrinkage • empirical Bayes • hierarchical Bayes • Does this multiple group problem arise in clustering? • I’ll argue “yes!” • If so, how do we “shrink” in clustering?

37. Multiple Data Analysis Problems • Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean: • Maximum likelihood estimates of are obtained independently • This often isn’t what we want (on theoretical and practical grounds)

38. Hierarchical Bayesian Models • Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the

39. Protein Backbone Modeling • An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids • For each amino acid, it turns out that two angles suffice; traditionally called φandψ • A plot of φandψ angles across some ensemble of amino acids is called a Ramachandran plot

40. A Ramachandran Plot • This can be (usefully) approached as a mixture modeling problem • Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid

41. Ramachandran Plots • But that plot is an overlay of 400 different plots, one for each combination of 20 amino acids on the left and 20 amino acids on the right • Shouldn’t we be treating this as a multiple clustering problem?

42. Haplotype Modeling • A haplotype is the pattern of alleles along a single chromosome • Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes: • Need to restore haplotypes from genotypes • A genotype is well modeled as a mixture model, where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)

43. Multiple Population Haplotype Modeling • When we have multiple populations (e.g., ethnic groups) we have multiple mixture models • How should we analyze these data (which are now available, e.g., from the HapMap project)? • Analyze them separately? Lump them together?

44. Scenes, Objects, Parts and Features

45. Shared Parts

46. Hidden Markov Models • An HMM is a discrete state space model • The discrete state can be viewed as a cluster indicator • We thus have a set of clustering problems, one for each value of the previous state (i.e., for each row of the transition matrix)

47. Solving the Multiple Clustering Problem • It’s natural to take a hierarchical Bayesian approach • It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori • How do we do this?

48. Hierarchical Bayesian Models • Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the

49. Hierarchical DP Mixture Model? • Let us try to model each group of data with a Dirichlet process mixture model • let the groups share an underlying hyperparameter • But each group is generated independently • different groups cannot share the same components if is continuous. spikes do not match up

50. Hierarchical Dirichlet Process Mixtures