1 / 35

Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology

Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology. Gilad Lerman Department of Mathematics University of Minnesota. IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07. Fundamental Problem in Molecular Evolution. How do we quantify the relationship between

Download Presentation

Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Functional Landscapes of Proteins via ManifoldEmbeddings of the Gene Ontology Gilad Lerman Department of Mathematics University of Minnesota IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07

  2. Fundamental Problem in Molecular Evolution How do we quantify the relationship between structure and function? More specifically: Given two protein domains, how similar are they in terms of function ? (i.e. form a functional distance for protein domains)

  3. GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCTAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCTAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTT “Nothing in Biology makes sense except in the light of evolution” Theodosius Dobzhansky (1900-1975) • Guides the construction of our functional metric • Relevant in interpreting our results

  4. Evolution of This Talk… • Background • Framework to study Structure-Function • Functional distance between protein domains • Function-Structure correlation • Convergent and divergent evolution • What’s next?

  5. Background Structure (Proteins) • Proteins are assembled spatially out of distinct structural units • These structural units are called protein domains • Protein domains fold independently Transferase (Methyltransferase) 1adm

  6. Decomposing a Protein into its Domains Fibronectin protein–1fnf

  7. 3-D Structure Comparison DALI • Automated comparison of 3D protein structures by 2D distance matrices • Z-score – structure similarity score Holm L, Sander C., JMB 1993, 233: 123-138

  8. Function Function: Gene Ontology (GO) • GO Goal: controlled vocabulary of genes + products in any organism (since 1998) Gene Ontology: tool for unification of biology M. Ashburner et al. (the gene ontology consortium). Nature Genet 25, 2000 • 3 structured vocabularies (species-independent) to describe gene products in terms of: 1) biological processes 2) cellular components 3) molecular functions • GO is friendly (google )

  9. GO Demonstration

  10. Sequence-Structure-Function Structures Protein domains Sequences of amino acids folding into domains Molecular Functions Gene Ontology (GO) Shakhnovich BE. et al. BMC Bioinformatics. 2003, 4:34 Shakhnovich BE..PLoS Comp. Biol. 2005 Jun;1(1):e9.

  11. Similarity Measures Holm L, Sander C., JMB 1993, 233: 123-138 • Structure (protein domains): Z-scores • Sequences: BLAST • Phylogenetic Information: MI score • Function Scores ???? Altschul SF, et. al JMB 1990 Oct 5;215(3):403-10. Pellegrini M, et. Al Proc Natl Acad Sci U S A. 1999 Apr 13;96(8):4285-8.

  12. Previous Functional “Distances”? 1. Similarity measures of ontologies (individual nodes) Lord PW et. al, Bioinformatics, 2003, 19(10): 1275-1283. • Assign local fractions p(n) for each node • pms(n1, n2) = min{p(n)} among parents n of n1 and n2 • “Distance” between protein domains (subgraphs) Shakhnovich BE, PLoS Comput Biol, 2005 Jun;1(1):e9. pA,i /pB,i - percentage of sequences that fold into structure A/B and annotated as function i

  13. Our Goal: Forming Distances • What’s given? GO graph & subgraphs of protein domains • Questions: • How to form meaningful similarities (between nodes)? • How to form distances from similarities (nodes)? • How to use these to form distances between domains (subgraphs)?

  14. Using Similarities to Create Distances for Nodes Machine Learning Framework: • Given: points (nodes) {xi}i=1,…,N, similarities K(xi,xj), such that K is symmetric and positive • Distance: d2(xi,xj) = K(xi,xi)+K(xj,xj)-2K(xi,xj) • Interpretation: K(xi,xj) = ‹φ(xi), φ(xj)›, then d2(xi,xj) = ||φ(xi) - φ(xj)||2 • φ – embedding from input to feature space (N) K – the kernel

  15. The mapping φ It can be obtained by either 1. Find the eigenpairs (u1, λ1),…,(uN, λN ) of K and set • Note that K(xi,xj) = ‹φ(xi), φ(xj)› • Form RKHS induced by K

  16. The “manifold embedding” →”φ”→ Remark: we do not use ”φ”, only the kernel K Figure by Todd Wittman (mani) →”φ”→ Figure by Coifman & Lafon

  17. How to Assign Similarities? • Local/ad hoc similarity • Global similarity: obtained by propagating local similarities (diffusion on graph mimicking evolutionary process)

  18. Forming a Diffusion Kernel nij number of domains (subgraphs) shared by nodes i & j Fiedler M. 1975. Czech. Math. Journal, 25: 619-633 Chung F. 1997 (book). AMS Kondor R, Lafferty JD: ICML 2002: 315-322 Belkin M, Niyogi P.Tech Report 2002 U. Chicago Ham J. et al. ICML 2004: 369-376 Coifman et al. PNAS 2005, 102 (21): 7426 Km is a diffusion kernel with parameter m

  19. Forming a Diffusion Distance • Formally: • Interpretation: It describes the rate of connectivity between vertices according to paths of length m (Szummer M. Jaakkola T. NIPS 2001, 14) (Ham J. et al. ICML 2004: 369-376) Coifman et al. PNAS 2005, 102 (21): 7426.

  20. Another Diffusion Distance • Another kernel… • The corresponding distance to this kernel is the expected time to travel from one vertex to another and then back again Coifman et al. PNAS 2005, 102 (21): 7426 Ham J. et al. ICML 2004: 369-376

  21. “Distances” Between Domains Given: d(x,y) – diffusion distance between annotation x and y Compute: d(x,A) – distance between annotation x and domain A d(A,B) – “distance” between domains A and B Dubuisson MP, Jain AK. IAPR 1994. 566-568.Memoli F, Sapiro G. Found. Comput. Math. 2005. 313-347.

  22. Quick Summary • Formed diffusion distance between functional annotation (nodes) • Formed functional distances between protein domains (subgraphs)

  23. What’s Next • We put those distances in context with the geometric structure • We indicate how those distances can infer evolutionary information

  24. Comparisons

  25. Functional Domain Universe Graph • FDUG: • Connect all edges (domains) with functional distance < Fmax • Color the top nine commonly occurring folds (use SCOP) • Identify main functional domains, e.g. • B: DNA Binding, C: RNA Binding, D: Exonucleases, E: • Transcription Factors

  26. Observation • Domain sharing fold classification form clusters with common functions • Domains with related functions are proximal H: Oxidoreductases, I: Dehydrogenases B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors

  27. Traversing the FDUG 1hlv Centromere Binding Protein 1gdt Site Specific Resolvase 2hdd Engrailed Transcription Factor

  28. Divergent Evolution • Biological characteristics with a common evolutionary origin that have diverged over evolutionary time. • Previous example may indicate a case of divergent evolution (common ancestry)

  29. Convergent Evolution • Definition in molecular evolution: • Two proteins with no apparent homology performing • the same function • We may identify such cases by searching for low F-scores and low Z-scores (large distances) • Example: convergence of tRNA synthases • 1pys and 1a8h, F-score = .001, Z-score < 2 • This example is well-documented • Mosyak L. et al. Nat Struct Biol 1995, 2:537-47 • Sugiura I. et al Nucliec Acids Res 2004 32, D189-92 In evolutionary biology: organisms acquiring similar characteristics while evolving in separate and sometimes varying ecosystems

  30. Summary • Defined a distance between protein functions (nodes) and functional distance between protein domains (subgraphs) • Shown correlation with structure, sequence and phylogeny • Explored structure-function relation via FDUG (functional domain universe graph) • Indicated examples of divergent and convergent evolution

  31. Some Future Projects • Extension to cellular components and processes and their use in quantitative research of convergent evolution • Infer function from structure (or vice versa) via supervised/semisupervised learning.

  32. Hybrid Linear Modeling Another Direction in Evolution Very Recent Interests • Study of evolution of transcriptional response to osmotic stress • Applying recent tools of knowledge discovery

  33. Thanks Contact: lerman@umn.edu Supplementary webpage: http://www.math.umn.edu/~lerman/supp/protein_distance/ Collaborator: Borya Shakhnovich, O’shea Lab, Harvard Support: NSF Thanks: • IPAM (Mark Green) for 2003 proteomics workshop • R.R. Coifman (Yale), S. Lafon (Google), M. Maggioni (Duke) • Organizers of current workshop

  34. Embedding Annotations on top 2 coordinates

  35. Embedding Protein Domains

More Related