350 likes | 355 Views
Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology. Gilad Lerman Department of Mathematics University of Minnesota. IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07. Fundamental Problem in Molecular Evolution. How do we quantify the relationship between
E N D
Exploring Functional Landscapes of Proteins via ManifoldEmbeddings of the Gene Ontology Gilad Lerman Department of Mathematics University of Minnesota IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07
Fundamental Problem in Molecular Evolution How do we quantify the relationship between structure and function? More specifically: Given two protein domains, how similar are they in terms of function ? (i.e. form a functional distance for protein domains)
GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCTAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCTAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTT “Nothing in Biology makes sense except in the light of evolution” Theodosius Dobzhansky (1900-1975) • Guides the construction of our functional metric • Relevant in interpreting our results
Evolution of This Talk… • Background • Framework to study Structure-Function • Functional distance between protein domains • Function-Structure correlation • Convergent and divergent evolution • What’s next?
Background Structure (Proteins) • Proteins are assembled spatially out of distinct structural units • These structural units are called protein domains • Protein domains fold independently Transferase (Methyltransferase) 1adm
Decomposing a Protein into its Domains Fibronectin protein–1fnf
3-D Structure Comparison DALI • Automated comparison of 3D protein structures by 2D distance matrices • Z-score – structure similarity score Holm L, Sander C., JMB 1993, 233: 123-138
Function Function: Gene Ontology (GO) • GO Goal: controlled vocabulary of genes + products in any organism (since 1998) Gene Ontology: tool for unification of biology M. Ashburner et al. (the gene ontology consortium). Nature Genet 25, 2000 • 3 structured vocabularies (species-independent) to describe gene products in terms of: 1) biological processes 2) cellular components 3) molecular functions • GO is friendly (google )
Sequence-Structure-Function Structures Protein domains Sequences of amino acids folding into domains Molecular Functions Gene Ontology (GO) Shakhnovich BE. et al. BMC Bioinformatics. 2003, 4:34 Shakhnovich BE..PLoS Comp. Biol. 2005 Jun;1(1):e9.
Similarity Measures Holm L, Sander C., JMB 1993, 233: 123-138 • Structure (protein domains): Z-scores • Sequences: BLAST • Phylogenetic Information: MI score • Function Scores ???? Altschul SF, et. al JMB 1990 Oct 5;215(3):403-10. Pellegrini M, et. Al Proc Natl Acad Sci U S A. 1999 Apr 13;96(8):4285-8.
Previous Functional “Distances”? 1. Similarity measures of ontologies (individual nodes) Lord PW et. al, Bioinformatics, 2003, 19(10): 1275-1283. • Assign local fractions p(n) for each node • pms(n1, n2) = min{p(n)} among parents n of n1 and n2 • “Distance” between protein domains (subgraphs) Shakhnovich BE, PLoS Comput Biol, 2005 Jun;1(1):e9. pA,i /pB,i - percentage of sequences that fold into structure A/B and annotated as function i
Our Goal: Forming Distances • What’s given? GO graph & subgraphs of protein domains • Questions: • How to form meaningful similarities (between nodes)? • How to form distances from similarities (nodes)? • How to use these to form distances between domains (subgraphs)?
Using Similarities to Create Distances for Nodes Machine Learning Framework: • Given: points (nodes) {xi}i=1,…,N, similarities K(xi,xj), such that K is symmetric and positive • Distance: d2(xi,xj) = K(xi,xi)+K(xj,xj)-2K(xi,xj) • Interpretation: K(xi,xj) = ‹φ(xi), φ(xj)›, then d2(xi,xj) = ||φ(xi) - φ(xj)||2 • φ – embedding from input to feature space (N) K – the kernel
The mapping φ It can be obtained by either 1. Find the eigenpairs (u1, λ1),…,(uN, λN ) of K and set • Note that K(xi,xj) = ‹φ(xi), φ(xj)› • Form RKHS induced by K
The “manifold embedding” →”φ”→ Remark: we do not use ”φ”, only the kernel K Figure by Todd Wittman (mani) →”φ”→ Figure by Coifman & Lafon
How to Assign Similarities? • Local/ad hoc similarity • Global similarity: obtained by propagating local similarities (diffusion on graph mimicking evolutionary process)
Forming a Diffusion Kernel nij number of domains (subgraphs) shared by nodes i & j Fiedler M. 1975. Czech. Math. Journal, 25: 619-633 Chung F. 1997 (book). AMS Kondor R, Lafferty JD: ICML 2002: 315-322 Belkin M, Niyogi P.Tech Report 2002 U. Chicago Ham J. et al. ICML 2004: 369-376 Coifman et al. PNAS 2005, 102 (21): 7426 Km is a diffusion kernel with parameter m
Forming a Diffusion Distance • Formally: • Interpretation: It describes the rate of connectivity between vertices according to paths of length m (Szummer M. Jaakkola T. NIPS 2001, 14) (Ham J. et al. ICML 2004: 369-376) Coifman et al. PNAS 2005, 102 (21): 7426.
Another Diffusion Distance • Another kernel… • The corresponding distance to this kernel is the expected time to travel from one vertex to another and then back again Coifman et al. PNAS 2005, 102 (21): 7426 Ham J. et al. ICML 2004: 369-376
“Distances” Between Domains Given: d(x,y) – diffusion distance between annotation x and y Compute: d(x,A) – distance between annotation x and domain A d(A,B) – “distance” between domains A and B Dubuisson MP, Jain AK. IAPR 1994. 566-568.Memoli F, Sapiro G. Found. Comput. Math. 2005. 313-347.
Quick Summary • Formed diffusion distance between functional annotation (nodes) • Formed functional distances between protein domains (subgraphs)
What’s Next • We put those distances in context with the geometric structure • We indicate how those distances can infer evolutionary information
Functional Domain Universe Graph • FDUG: • Connect all edges (domains) with functional distance < Fmax • Color the top nine commonly occurring folds (use SCOP) • Identify main functional domains, e.g. • B: DNA Binding, C: RNA Binding, D: Exonucleases, E: • Transcription Factors
Observation • Domain sharing fold classification form clusters with common functions • Domains with related functions are proximal H: Oxidoreductases, I: Dehydrogenases B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors
Traversing the FDUG 1hlv Centromere Binding Protein 1gdt Site Specific Resolvase 2hdd Engrailed Transcription Factor
Divergent Evolution • Biological characteristics with a common evolutionary origin that have diverged over evolutionary time. • Previous example may indicate a case of divergent evolution (common ancestry)
Convergent Evolution • Definition in molecular evolution: • Two proteins with no apparent homology performing • the same function • We may identify such cases by searching for low F-scores and low Z-scores (large distances) • Example: convergence of tRNA synthases • 1pys and 1a8h, F-score = .001, Z-score < 2 • This example is well-documented • Mosyak L. et al. Nat Struct Biol 1995, 2:537-47 • Sugiura I. et al Nucliec Acids Res 2004 32, D189-92 In evolutionary biology: organisms acquiring similar characteristics while evolving in separate and sometimes varying ecosystems
Summary • Defined a distance between protein functions (nodes) and functional distance between protein domains (subgraphs) • Shown correlation with structure, sequence and phylogeny • Explored structure-function relation via FDUG (functional domain universe graph) • Indicated examples of divergent and convergent evolution
Some Future Projects • Extension to cellular components and processes and their use in quantitative research of convergent evolution • Infer function from structure (or vice versa) via supervised/semisupervised learning.
Hybrid Linear Modeling Another Direction in Evolution Very Recent Interests • Study of evolution of transcriptional response to osmotic stress • Applying recent tools of knowledge discovery
Thanks Contact: lerman@umn.edu Supplementary webpage: http://www.math.umn.edu/~lerman/supp/protein_distance/ Collaborator: Borya Shakhnovich, O’shea Lab, Harvard Support: NSF Thanks: • IPAM (Mark Green) for 2003 proteomics workshop • R.R. Coifman (Yale), S. Lafon (Google), M. Maggioni (Duke) • Organizers of current workshop