1 / 89

Charting the Protein Space Structural and Functional Genomics

Charting the Protein Space Structural and Functional Genomics. Michal Linial The Hebrew University, Jerusalem. Structure is more conserved than sequence. Similar structure tend to have similar function. Extract structural information from sequence alone (The Holy Grail).

Download Presentation

Charting the Protein Space Structural and Functional Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charting the Protein SpaceStructural and Functional Genomics Michal LinialThe Hebrew University, Jerusalem Belgium 10/03 Michal Linial

  2. Structure is more conserved than sequence Similar structure tend to have similar function Extract structural information from sequence alone (The Holy Grail) A link between sequence, structure and function Sequences are ‘easy’ Structures are ‘hard’ Functions are to be defined Function Structure Sequence Belgium 10/03 Michal Linial

  3. Structure space sparse 2000- 20,000 Function space Ill-defined ????? (20,000by GO) The protein space sequence, structure and function Sequence space dense 1,000,000 Belgium 10/03 Michal Linial

  4. enzymes Structural proteins sensors signaling catalytic channels Intrinsic difficulty in defining function The protein Space static vs dynamic Protein Sequences 1,000,000 pr (static) Protein Variants 10,000,000 pr (dynamic) Exon combinations, post-translation modification, p-p interaction… Protein Function ????? Belgium 10/03 Michal Linial

  5. The Challenges of the ‘Proteome’ in the Genomic era New genomes --->Accurate annotation From sequence ---> Predicting structure From sequence ---> Infer function Proteins in a cellular context (health) Modification Localization Interactions Pathways Disease Belgium 10/03 Michal Linial

  6. Outine • Structural Genomics • What for - the challenge • How - classification & methodology • Tests - validation scheme • In practice - ProTarget • Functional Genomics • What for - the challenge • How - Integration & methodology • Tests - examples • In practice -PANDORA Belgium 10/03 Michal Linial

  7. Motivation Structural Genomics Intiatives Goal: Cover the entire protein structural space Modeling methods allow expending structural assignments to an unsolved protein if a solved protein is within a ‘modeling distance’ (>30-35% sequence identity) from an unsolved one. Finding a new Fold = Adding a new template to the ‘archive’ = Allowing (many !) ‘unsolved’ proteins to be modeled.

  8. Motivation Structural Genomics Intiatives And as stated by SG policy (1999) “Maximizing the impact on biology and on biomedical sciences by solving the ‘CORRECT’ pre-selected candidates” What is the ‘CORRECT’ set of proteins ?? How to select those from all possible unsolved proteins ??

  9. Current State Structural database Number of new structures added each year (from the PDB)

  10. Current State Structural database The fraction of new folds is constantly decreasing During last 5 years only 3-5% (by SCOP definition) of all new solved structures are new folds (5-10% by CE).

  11. Myoglobin Current State The Structural Spacesome numbers Currently :~18,000 protein structures ~45,000 protein domains Hierarchy in structure SCOP 1.59, 3/02 SCOP 1.61, 11/02 Folds - 690 700 +10 SF - 1070 1110 +40 Fam - 1830 1940 +110 Domain -39,900 44,300 +4,400

  12. Structure-based Sequence-based Reduction -How?? Current State Some numbers Fold, SF, Fam Sequence-Base : 130,000 SWP, 900K TrEMBL , Total: >1M Estimated numbers: Structure-Base : 1,000 - 2,000 folds 3,000- 8,000 superfamiles 10,000-20,000 families (25-35% sequence identity) (But many more ‘unique’ folds/superfamilies ?)

  13. Structure-based Sequence-based Reduction -How?? Challenge From sequence to structure Problem: Most structurally similar pairs share <20% aa identity Many structurally similar pairs share only few key aa (5-8%, background) Most (all) sequence search engines cannot find a ‘significant’ similarity below 35-40% aa identity So, can we cross the line to the ‘Twilight Zone’ (20-35% aa identity) to the ‘midline zone’ (<20% aa identity)

  14. Outine • Structural Genomics • What for - the challenge • How - classification & methodology • Tests - validation scheme • In practice - ProTarget • Functional Genomics • What for - the challenge • How - Integration & methodology • Tests - examples • In practice -PANDORA

  15. Seeking statistically significant regularities (clusters) Reconstruct the ‘geometry’ of the sequence space Guiding principle Homologous proteins evolved from common ancestor protein Homology is a transitive relation that can be deduced based on statistical similarities ProtoClass - Set of automatic classifications of all proteins

  16. ProtoMaprelease May 1997 ProtoNet - A (arithmetric) release July 2002 ProtoNet - G (geometric)release July 2002 ProtoNet - H (harmonic)release July 2002 Proto3D + ProtoNet -T October 2003 ProtoNet - A50 July 2003 Proto3D - A50 July 2003 ProtoClass Global classifications of all proteins ProtoClass systems generate graphs and maps that yield views at any levels of granularity. Belgium 10/03 Michal Linial

  17. Pre-Computation • SwissProt release 40.28 database (ProtoNet 2.4) • 114,000 SWP proteins • 133,000 + 850,000 TrEMBL sequences (ProtoNet 3.0) • All-against-all similarity scores by gapped BLAST • Using BLOSUM62, eliminating low-complexity (also other matrices, BLOSUM 50, PAM 250..) • BLAST identified >13M relations between 114K SWP proteins • sequence similarity E-Score of 100 !!! or less is collected Belgium 10/03 Michal Linial

  18. ProtoClass ProtoNet main features Pairwise distances (all against all BLAST search) Includes all SwissProt proteins (130K) TrEMBL proteins (850K) Graph based Unsupervised and automated Hierarchical The clustering algorithm is based on a ‘merging score’ Bottom-up clustering Belgium 10/03 Michal Linial

  19. ProtoClass ProtoNet top 20 20 largest clusters in the ProtoNet at pre-selected horizontal level (7K) Added hypothetical proteins 7-15% 15-20% Belgium 10/03 Michal Linial

  20. Towards functional Map Roadmap of Ig Superfamily Edges connect clusters that are neighbors but failed to merge at that LEVEL of the graph Many pairs of proteins with <<20% aa identity Belgium 10/03 Michal Linial Yona G., Linial N., Linial M. Proteins 37:360-378 (1999)

  21. What is missing: A rational computational procedures for identifying ‘missing/hidden’ folds/SF Our approach: Constructing the protein sequence space as a guideline for structural fold space Crossing the twilight zone Goal Seeking missing folds

  22. Seq-Str map Bridging Structure & Sequence Hypothesis: Distances in the graph (road-map) are consistent with distances between protein features, including their structure. Practically: Unsolved clusters that are ‘remote’ (in the road-map) from an already solved structure will have higher chance to have new folds or new superfamilies.

  23. Clock -Pair Time Good target? In PDB Good target? Seq-Str map Distance measure via Structural perspective Create Proto3D (all SWP+all PDB domains) (114K+36K= 150K)

  24. Example Globins Short (~120-160 aa)Oxygen transport in multi-cell organismsSingle domainSpread in evolutionEarly evolutionary duplicationsSequence similarity <15%SCOP identified 50 ! different family members (neuronal, plant…)

  25. Seq-Str map Some biological Road Maps SCOP Fold: Globin - like SF: A.Globin-like B. a-helical ferredoxin Fam A:1. Globin (50) 2. 3. Neural globin (1) 4. Fam B: 1. 2. All 850 proteins are globin related All belong to one SF (SCOP)

  26. ProtoNet SCOP Currently ~2000 fam Sassson et al (2003) Nucl. Acids Res. 31 Murzin A. G. et al. (1995). J. Mol. Biol. 247, 536-540 Mapping SCOP structure on the Sequence-based clusters A very good correspondence between clusters and SCOP families

  27. Seeking new folds Our approach: Structural information is embedded in the roadmap of ProtoClass (I.e., globins) We developed a navigating procedure that measures ‘distances’ among protein clusters in the graph of view of proteins that were already solved (X-ray, NMR)

  28. Computational Approach for Target SelectionAdding Structures to the map ProtoNet (at a selected level) ~10,000 clusters ; 2000 clusters > 15 proteins each SCOP 1.50 (2000) ~10,500 PDB structures, 24,000 domains (redundant) Each structural domain is mapped to its proteins (and its cluster). ‘Occupied’ clusters are those with at least one solved structural domain.

  29. occupied occupied Mapping ‘Structures’ on the Protein Graph Databases used Structural All PDB entries Sequence ProtoClass (I.e. ProtoMap, ProtoNet) ~only 1800 clusters are ‘occupied’. They accounts for ~50% of all proteins in the protein map.

  30. occupied Steps= 3 VSV = 11 A A distance measure in the graphvacantsurrounding volumes A distance measure in the graph (VSV): the vacant-surrounding-volume of a cluster is the number of clusters before encountering an occupied cluster Clusters are associated with VSV (if at least one structure is in the local map) All clusters are sorted according to their VSV.

  31. Prioritized Target List Higher VSV, higher chance for NEW SUPERFAMILY ?

  32. Outine • Structural Genomics • What for - the challenge • How - classification & methodology • Tests - validation scheme • In practice - ProTarget • Functional Genomics • What for - the challenge • How - Integration & methodology • Tests - examples • In practice -PANDORA

  33. Testing the predicting power of the VSV navigation method VSV & NEW SUPERFAMILY ? The membranous protein test All clusters Most clusters with membranous proteins have much higher VSV. This is in accord with the fact that very very small number of membranous proteins were solved (50 out of 20,000) Membranous

  34. As base set SCOP 1.37 (~12,000 records ) ~800 families ~570 superfamilies ~410 folds SCOP 1.50 (~23,800 records ) ~1300 families, ~820 superfamilies ~550 folds As test set Validation against new data

  35. Validation against new data Test the prediction by the VSV method (BASE set) with the actual assignment of new SF in recent data (TEST set). BASE SET ~570 superfamilies TEST SET ~820 superfamilies 250 additional new SF The Base Set and the Test Set have no overlap

  36. New SF Testing the predicting power of the VSV navigation method VSV & NEW SUPERFAMILY ? Statistical test Prediction is based on 13,000 domains (1999) Test is based on new added 11,000 domains (2001)

  37. VSV 3 4 5 6 7 8 10 VSV according to set of SCOP 1.37 to 1.50 Known NEW Our hypothesis is confirmed - the higher the VSV is, the chance of a protein to belong to a new SF increases

  38. Outine • Structural Genomics • What for - the challenge • How - classification & methodology • Tests - validation scheme • In practice - ProTarget • Functional Genomics • What for - the challenge • How - Integration & methodology • Tests - examples • In practice -PANDORA

  39. ProTarget - a web site that assign a ‘SCORE’ for proteins according to their probability to belong to new superfamily (or fold) Back from Prediction to the experimentalists We suggest a ranked list that is ‘BEST’ for SG projects. The user may select any subset

  40. ProTarget

  41. Development in ProTarget Dynamic view - Proteins that have been solved affect the the map and of course the VSV ranking. Can I find the group of proteins that once ‘solved’ their impact is maximal (affected the ranking of at least X proteins) Other features Including domain composition to the VSV ranking method (coming)

  42. When New structure is solved (or about to be solved), a new map is created with new VSV and prioritization is done automatically Using the dynamic option, redundancy in solving similar structures is reduced

  43. Using ProTarget dynamically

  44. Targets Cloning Expression Solubility Crystallization To the experimentalist (SG center) Structural Genomics Projects

  45. Practical Biological Computational Considerations in solving structures • Quantity - sources • Folding properties • Expression system • Intrinsic stability • Bad history, membranous • Glory, money and fame • .……. • Novel biological activity ? • Selectivity? Specificity? • Ligand / drug binding ? • Disease related? • Drug design relevance? • …...

  46. Outine • Structural Genomics • What for - the challenge • How - classification & methodology • Tests - validation scheme • In practice - ProTarget • Functional Genomics • What for - the challenge • How - Integration & methodology • Tests - examples • In practice -PANDORA

  47. Disease Evolution Genes, regulation Protein The Subway, Tube, Underground, Metro, U-Bahn

  48. Sequence and Function relationshiptaking one example: Enzymeswell characterizedfunctionality is definedconservedessential, testabletree like classification

  49. Relatively easy ‘function’ ENZYMES DB -Enzyme, WIT, KEGG etc

More Related