Proteomics: Analyzing proteins space

Proteomics: Analyzing proteins space

Protein families Why proteins? • Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families- what is it good for? • Explosion in biological sequence data => need to organize! • Understanding relations/hierarchy of groups is interesting as is, e.g. in evolutionary research. • For applied research : • Annotation of new proteins : predicting their function, structure, cellular localization etc. • Looking for new folds

Sequence-based classification • By sequence similarity (domains, motifs or complete proteins) : Pfam, PROSITE, SMART, InterPro etc. • InterPro – Synthesizes the data from Pfam, PROSITE, Prints, ProDom, and SMART. Considered as “best” domain-based classification available

Other kinds of classification • Global classification : • Systers, Protomap, CLUSTr • MetaFam synthesizes global classification data • By structure similarity : SCOP etc. • By function : Albumin, RetNet, TumorGenes etc.

Proto Net http://www.protonet.cs.huji.ac.il • A long-term project in HUJI led by Michal & Nati Linial. • Provides automatic global classification of the known proteins. • Performs hierarchical clustering on sequence-based metric space of proteins. • Allows to “place” an external protein into the hierarchy.

Why clustering? • We want to refine the “similarity” notion, compared to e.g. BLAST • Exploit transitivity to improve grouping • Can use a low threshold on similarity: - uses vast information from low similarities - allowable because clustering filters noise

Why hierarchical? Vertical Perspective Horizontal Perspective

ProtoNet: Pre-Computation • All-against-all gapped BLAST using BLOSUM62 • SwissProt release 40.28 database (114,033 proteins) • BLAST identified ~2*107 relations between these proteins with relatively high sequence similarity E-Score of 100 or less: • Don’t want to lose information => very permissive! • But still less then ~6.5*109 => infeasible

Clustering Method • First, each cluster is considered a singleton

Clustering Method • Next, we iteratively merge the pairs of clusters • We choose to merge the ‘most similar’ pair of clusters.

Clustering Method • As we progress the number of singletons drops

Clustering Method • The clustering process gradually generates a tree of clusters • Stop whenever we like

m n How to merge? • The potential merging score is calculated for each pair of clusters relevant for merging at each level • At the bottom equals • Higher, designed to reflect the similarity of clusters. • Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.

Potential Merging Score of • Arithmetic Mean VI • Geometric Mean VI • Harmonic Mean

Missing Data Treatment • For very low similarity pair (outside of ~2*107), its length is defined as • Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

Why clustering at all? • We want to extend the range of “similarity”, compared to e.g. BLAST • Exploit transitivity to improve grouping • Can use a low threshold on similarity: • - uses vast information from low similarities • - allowable because clustering filters noise Results: ProtoNet top 20 20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

Problem of result assessment: what is a “good” cluster? • Contains all proteins in the family, does not contain proteins not in family • But what is family? Does any keyword define a family? • Stable as the merging events occur (long life-time)?

Problem of result assessment: what is a “good” tree? • Should we trust the resulting forest? • Which clustering technique is better? Combined? • Bootstrap? • Do the clusters correspond to meaningful families of proteins? • Validation against InterPro, SCOP etc. • Lack of will to automatically reconstruct them!!! • What is the right level/cut to look at the forest?

Interpro Validation • Interpro annotation allows systematic validation of the generated clustering • The ‘geometric’ method exhibits high cluster purity • Corresponds to low FP

The Domain Problem • Many proteins are composed of several domains • The sequence similarity tools used are therefore local in nature: • The score of comparing two sequences is the edit distance of the most similar subsequences of them • This creates a false similarity problem:

K6A1 MOUSE CSKP HUMAN DLG3 MOUSE MPP3 HUMAN Serine/Threonine protein kinase family active site Protein kinase C-terminal domain PDZ domain SH3 domain Guanylate kinase The Modular Nature of Proteins

K6A1 MOUSE 1e-42 CSKP HUMAN 9e-41 8e-78 DLG3 MOUSE 2e-47 MPP3 HUMAN False Transitivity of Local Alignment We ran BLASTusing default parameters: All these pairwise similarities havebetter than 1e-40 EScore If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

Alternative methods • Different types of clustering • Non-binary • Goal-oriented => semi-guided • Graph theory insights • Non-clustering ways of exploring the space of proteins • Why BLAST E-score??? • Enrichment of the metric using structure

Proteomics: Analyzing proteins space