Classifying the protein universe

Synapse-Associated Protein 97 Classifying the protein universe Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families • Introduction • What are protein families? • Protein families • Description & Definition • Motifs and Profiles • The modular architecture of proteins • Domain Properties and Classification

Protein family 1 Protein family 2 Protein Families • Protein families are defined by homology: • In a family, everyone is related to everyone • Everybody in a family shares a common ancestor:

1chg 1sgt 1chg 1sgt Homology versus Similarity • Homologous proteins have similar 3D structures and (usually) share common ancestry: • 1chg and 1sgt  31% identity, 43% similarity • We can infer homology from similarity! Superfamily: Trypsin-like Serine Proteases

1chg 1sgc 1chg 1sgc Homology versus Similarity • But Homologous proteins may not share sequence similarity: Superfamily: Trypsin-like Serine Proteases 1chg and 1sgc  15% identity, 25% similarity We cannot infer similarity from homology

1chg 2baa 1chg 2baa Homology versus Similarity • Similar sequences may not have structural similarity: 1chg and 2baa  30% similarity, 140/245 aa We cannot assume homology from similarity!

Homology versus Similarity • Summary • Sequences can be similar without being homologous • Sequences can be homologous without being similar Families ?? Evolution / Homology BLAST Similarity

Description of a Protein Family • Let’s assume we know some members of a protein family • What is common to them all? • Multiple alignment!

Techniques for searching sequence databases to Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family • Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string • Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur • Intermediate sequence search - link many profile searches

Motif Description of a Protein Family • Regular expressions: ........C.............S...L..I..DRY..I.......................W... I E W V / C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

Automated Motif Discovery • Given a set of sequences: • GIBBS Sampler • http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein • MEME • http://meme.sdsc.edu/meme/ PRATT • http://www.ebi.ac.uk/pratt • TEIRESIAS • http://cbcsrv.watson.ibm.com/Tspd.html • Combinatorial output!

Automated Profile Generation • Any multiple alignment is a profile! • PSIBLAST • Algorithm: • Start from a single query sequence • Perform BLAST search • Build profile of neighbours • Repeat from 2 … • Very sensitive method for database search

Profile2 After n iterations Query Profile1 ... Threshold for inclusion in profile PSI-BLAST • Position Specific Iterative Blast • PSI-Blast profile models only positions in the query sequence

HMMs • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000) • •Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended) • More the number of sequences better the models. • One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)

HMM libraries • PFAM • http://www.sanger.ac.uk/Pfam • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Pfam-A entries are high quality, manually curated families. • Pfam-B entries are generated automatically.

GTG steps • Generate alignment trace graph • Nodes = residues • Edges = aligned in PSI-Blast library • Unweighted • Edge weighting • Using consistency • Clustering • Driven by consistency • Single site occupancy rule • Post-processing • Generate non-redundant set of inter-cluster edges • Identify sub-trees with conserved residues

Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Alignment trace graph Residues more residues • Graph representation of input pairwise alignment data • Vertices = residues • Edges = aligned in a pairwise alignment from input library

Consistency = neighbour overlap i j Weight = intersection / union

GTG – global trace graph • Input: PSI-Blast all versus all alignments in NRDB40 • Output: superalignment of all proteins • Applications • Pairwise alignment of query and target sequences • Transitive sequence database searching (fast) • Tracking conserved residues (feature space)

Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site occupancy) Alignment trace graph Cluster 1 Cluster 2

consistency consistency consistency A H G A A A K K K K K K K K A ‘Motif tracking’ Each vertex is labelled with source protein and position in sequence. Motifs are subtrees enriched in one particular amino acid type.

Remote homolog detectionbased on GTG alignment score GTG clustering is informative; detect as many remote homologs as threading methods

Summary • Super-families form elongated clusters in “protein space” • Profile models fluctuations around an equilibrium point • Consistency ~ path model • Exploits multiple profile models • Discriminative in database searching • Global trace graph data structure • Feature space for pattern discovery http://ekhidna.biocenter.helsinki.fi/gtg/start

Relationships between families • Pfam clans • A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. • Superfamily • http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html • The sequence search method uses a library (covering all proteins of known structure) consisting of 1539 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models. • Pfam-squared • Based on GTG comparisons of representative sequences from each PFAM-A family against all PFAM-A families. • Rules of thumb: motif score>1000 means probably related, motif score >500 means possibly related, score <500 means dubious

Benchmarking a motif/profile • You have a description of a protein family, and you do a database search… • Are all hits truly members of your protein family? • Benchmarking: TP: true positive TN: true negative FP: false positive FN: false negative Result family member Dataset not a family member unknown

Benchmarking a motif/profile • Precision / Selectivity • Precision = TP / (TP + FP) • Sensitivity / Recall • Sensitivity = TP / (TP + FN) • Balancing both: • Precision ~ 1, Recall ~ 0: easy but useless • Precision ~ 0, Recall ~ 1: easy but useless • Precision ~ 1, Recall ~ 1: perfect but very difficult

Triosephosphate isomerase Phosphoglycerate kinase The Modular Architecture of Proteins • BLAST search of a multi-domain protein

What are domains? • Functional - from experiments: example: Decay Accelerating Factor (DAF) or CD55 • Has six domains (units): • 4x Sushi domain (complement regulation) • 1x ST-rich ‘stalk’ • 1x GPI anchor (membrane attachment) • PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

There is only so much we can conclude… • Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)] • Classifying complete sequences (predicting molecular function of proteins, large scale annotation) • Majority of proteins are multi-domain proteins.

What are domains? • Mobile – Sequence Domains: Protein 1 Protein 2 Protein 3 Protein 4 Mobile module

Domains are... • ...evolutionary building blocks: • Families of evolutionarily-related sequence segments • Domain assignment often coupled with classification • With one or more of the following properties: • Globular • Independently foldable • Recurrence in different contexts • To be precise, • we say: “protein family” • we mean: “protein domainfamily”

Example: global alignment • Phthalate dioxygenase reductase (PDR_BURCE) • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME) Global alignment fails! Only aligns largest domain.

Sometimes even more complex! PGBM_HUMAN:“Basement membrane-specific heparan sulphate proteoglycan core protein precursor” 980 1960 2940 3920 4391 45 domains of 9 different type, according to PFam http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160 http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

Properties of domains • Most domains: size approx 75 – 200 residues

So, you have a sequence... • ...look it up in existing database • INTERPRO: http://www.ebi.ac.uk/interpro • ...search against existing family descriptions • PFAM: http://www.sanger.ac.uk/Software/Pfam • INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/

Classifying the protein universe