1 / 25

Proteomics: Analyzing proteins space

Proteomics: Analyzing proteins space. Protein families. Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families - what is it good for? Explosion in biological sequence data => need to organize!

sierra
Download Presentation

Proteomics: Analyzing proteins space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteomics: Analyzing proteins space

  2. Protein families Why proteins? • Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families- what is it good for? • Explosion in biological sequence data => need to organize! • Understanding relations/hierarchy of groups is interesting as is, e.g. in evolutionary research. • For applied research : • Annotation of new proteins : predicting their function, structure, cellular localization etc. • Looking for new folds

  3. Sequence-based classification • By sequence similarity (domains, motifs or complete proteins) : Pfam, PROSITE, SMART, InterPro etc. • InterPro – Synthesizes the data from Pfam, PROSITE, Prints, ProDom, and SMART. Considered as “best” domain-based classification available

  4. Other kinds of classification • Global classification : • Systers, Protomap, CLUSTr • MetaFam synthesizes global classification data • By structure similarity : SCOP etc. • By function : Albumin, RetNet, TumorGenes etc.

  5. Proto Net http://www.protonet.cs.huji.ac.il • A long-term project in HUJI led by Michal & Nati Linial. • Provides automatic global classification of the known proteins. • Performs hierarchical clustering on sequence-based metric space of proteins. • Allows to “place” an external protein into the hierarchy.

  6. Why clustering? • We want to refine the “similarity” notion, compared to e.g. BLAST • Exploit transitivity to improve grouping • Can use a low threshold on similarity: - uses vast information from low similarities - allowable because clustering filters noise

  7. Why hierarchical? Vertical Perspective Horizontal Perspective

  8. ProtoNet: Pre-Computation • All-against-all gapped BLAST using BLOSUM62 • SwissProt release 40.28 database (114,033 proteins) • BLAST identified ~2*107 relations between these proteins with relatively high sequence similarity E-Score of 100 or less: • Don’t want to lose information => very permissive! • But still less then ~6.5*109 => infeasible

  9. Clustering Method • First, each cluster is considered a singleton

  10. Clustering Method • Next, we iteratively merge the pairs of clusters • We choose to merge the ‘most similar’ pair of clusters.

  11. Clustering Method • Next, we iteratively merge the pairs of clusters • We choose to merge the ‘most similar’ pair of clusters.

  12. Clustering Method • Next, we iteratively merge the pairs of clusters • We choose to merge the ‘most similar’ pair of clusters.

  13. Clustering Method • As we progress the number of singletons drops

  14. Clustering Method • The clustering process gradually generates a tree of clusters • Stop whenever we like

  15. m n How to merge? • The potential merging score is calculated for each pair of clusters relevant for merging at each level • At the bottom equals • Higher, designed to reflect the similarity of clusters. • Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.

  16. Potential Merging Score of • Arithmetic Mean VI • Geometric Mean VI • Harmonic Mean

  17. Missing Data Treatment • For very low similarity pair (outside of ~2*107), its length is defined as • Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

  18. Why clustering at all? • We want to extend the range of “similarity”, compared to e.g. BLAST • Exploit transitivity to improve grouping • Can use a low threshold on similarity: • - uses vast information from low similarities • - allowable because clustering filters noise Results: ProtoNet top 20 20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

  19. Problem of result assessment: what is a “good” cluster? • Contains all proteins in the family, does not contain proteins not in family • But what is family? Does any keyword define a family? • Stable as the merging events occur (long life-time)?

  20. Problem of result assessment: what is a “good” tree? • Should we trust the resulting forest? • Which clustering technique is better? Combined? • Bootstrap? • Do the clusters correspond to meaningful families of proteins? • Validation against InterPro, SCOP etc. • Lack of will to automatically reconstruct them!!! • What is the right level/cut to look at the forest?

  21. Interpro Validation • Interpro annotation allows systematic validation of the generated clustering • The ‘geometric’ method exhibits high cluster purity • Corresponds to low FP

  22. The Domain Problem • Many proteins are composed of several domains • The sequence similarity tools used are therefore local in nature: • The score of comparing two sequences is the edit distance of the most similar subsequences of them • This creates a false similarity problem:

  23. K6A1 MOUSE CSKP HUMAN DLG3 MOUSE MPP3 HUMAN Serine/Threonine protein kinase family active site Protein kinase C-terminal domain PDZ domain SH3 domain Guanylate kinase The Modular Nature of Proteins

  24. K6A1 MOUSE 1e-42 CSKP HUMAN 9e-41 8e-78 DLG3 MOUSE 2e-47 MPP3 HUMAN False Transitivity of Local Alignment We ran BLASTusing default parameters: All these pairwise similarities havebetter than 1e-40 EScore If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

  25. Alternative methods • Different types of clustering • Non-binary • Goal-oriented => semi-guided • Graph theory insights • Non-clustering ways of exploring the space of proteins • Why BLAST E-score??? • Enrichment of the metric using structure

More Related