Protein homology detection by HMM–HMM comparison Johannes Söding

Protein homology detectionby HMM–HMM comparisonJohannes Söding A topic in Sequence analysis Presented by: Giriprasad Sridhara giri@UDel.edu CISC 841 Spring 2006 APR 20 2006

Organization of presentation • Introduction • Theory • Results • Conclusion

Introduction • Paper Details: • Bioinformatics journal • Vol. 21 no. 7 2005, pages 951–960 • Author Details • Dr. Johannes Söding • Department of Protein Evolution, MaxPlanckInstitute for Developmental Biology, Spemannstrasse 35, D72076 Tübingen, Germany

Introduction • Tool Details: • A tool HHPred has been developed. • Described in Nucleic Acid Research, 2005, Vol 33 • A web server is available at http://www.protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred

Introduction • Central theme in bioinformatics: • Homology and sequence alignment • Issues: • Problem of finding a close homolog with known function or structure which would allow to make inferences about the protein under observation. • New and highly sensitive methods could detect and align remotely homologous sequences that provide information about the protein’s function, structure or evolution.

Introduction • Methods (Tools) of homology detection: (In increasing order of sensitivity) • Sequence - Sequence • BLAST • FASTA • Profile - Sequence • PSIBLAST • More sensitive since it uses a sequence profile • Profile – Profile • COMPASS • PROF_SIM • Profile - HMM • HMMER • HMM-HMM • HHPred

Introduction • Sequence profiles • Built from a multiple alignment of homologous sequences • Contains more information about the sequence family than a single sequence. • Helps to distinguish between • conserved and non-conserved positions • Conserved are important for defining members of the family • Non-conserved are variable among the members of the family. • Describe exactly • what variation in amino acids is possible at each position • Done by recording the probability for the occurrence of each amino acid along the multiple alignment.

Introduction • Profile Hidden Markov Models (Profile HMMs) • Similar to simple sequence profiles • have amino acid frequencies as in the columns of a MSA • Also have position specific probabilities for inserts and deletions along the alignment • logarithms of these probabilities =position specific gap penalties • Perform better than sequence profiles in the detection of homologs and in the quality of alignments • Why higher sensitivity? • Position specific gap penalties penalize chance hits much more than true positives • which tend to have insertions or deletions at the same positions as the sequences from which the HMM was built.

Introduction • Pictorial representation of profile HMM With M, I and D states.

Theory • Align 2 HMM by maximizing a score • Score is log-sum-of-odds score. • What does a path through the 2 HMMs Represent? • A sequence co-emitted by both HMMs • How do we find this path? • Use dynamic programing (Viterbi) • Find path that maximizes log-sum-of-odds score

Theory • Advantages of HMM-HMM • Improves both sensitivity and alignment quality • Calibrate the score for additional sensitivity • Use scoring correlation function • Use secondary structure information • Even sequences that are distantly homologous will have similar secondary structures. • This can help distinguish real homologs from chance hits • Biologically, secondary structures diverge more slowly than sequences • This knowledge is utilized.

Theory

Theory • Additionally to enhance homology detection • Score secondary structure • Use other available additional information (like confidence – term covered later on in the slides) • Tool HHPred • Homology detection & structure prediction • Novelty • HMM-HMM comparison • Scores secondary structure • Reliability measured by • Probability of each match being a true positive • Used since e-values reported by most tools can be inaccurate

Theory (Log-sum-of-odds score) • Defined as • Numerator • probability that x1,…xL is co-emitted by both HMMs along the alignment path • Denominator • Standard null model probability • Summation • Runs over all sequences of L residues that can be emitted along the alignment path by both HMMs

Theory • How do we apply Viterbi algorithm? • Denote • 2 HMMs p and q • Probability of emitting amino acid a in match state i or j is qi(a) and pj(a) • Trans prob = qi(X, X’) and pj(Y,Y’) • X or Y can belong to {M, I or D} • f(a) = fixed background frequency • Let Xk and Yk be states in q and p in the k’th column of the alignment of q with p. • i(k) and j(k) be the corresponding columns from q and p. • qk(l)P (a) and pk(l)P(a) = emission prob from q and p.

Theory • Ρtr is the product of all transition probabilities for the path through p and q • qk(l)P (a) = qi(k) (a) for Xk = M • qk(l)P (a) = f (a) for Xk = I

Theory • Column score properties: • Positive when 2 distributions are similar • Negative otherwise • Insert states have vanishing column score • Completely non-conserved, pj(a) = f(a) • 1/f(a) • Weight factor to co-emission probability • For a rare amino acid • f(a) will be low  1/f(a) will be high •  Weight of rarer amino acids increases in the score calculation as compared to common amino acids.

Theory • Pair-wise alignment of HMMs • Allowed transitions • Dynamic programing matrices for Viterbi

Theory • We use 5 DP matrices S xy one for each pair state XY belonging to {MM, MI, IM, DG, GD} • SMM (i, j) = Saa(qi,pj) + max { SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)], SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)] SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)] SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)] SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] }

Theory • SMI (i, j) = max { SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)], SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] } • SDG (i, j) = max { SMM(i-1,j) + log[q i-1(M,D)], SDG(i-1,j) + log[q i-1(D,D) } • Initialize SMM(I,0) = 0 = SMM(0,j) • S LSO = max over last row, col of S MM • Trace back from this cell.

Theory • Scoring correlations • Clustering • In an alignment of 2 homologous HMMs • Expect high column scores in • Clusters along the sequence • In an alignment of non-homologous HMMs • Do not Expect any clustering. • The above can help • Differentiate homologous and non-homologous alignments • If l’th pair state of optimum path aligns columns i(l) of q and j(l) of p • Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0. • Auto-correlation function

Theory • Scoring correlations • Auto-correlation function describes correlation of Sl at a fixed sequence separation d • Expect • if 2 HMMs are homologous • A Positive g(d) for small d. • Add a correction factor • wcorr is found empirically to be 0.1 • The correction factor is added after the best alignment is found.

Theory • Scoring secondary structure • Allows to score predicted secondary structure against • Another predicted secondary structure • Or a known secondary structure • Predicted secondary structure vs. known secondary structure. • DSSP used to assign 1 of 7 states of observed secondary structure • PSIPRED used to predict secondary structure states, H, E or C. • Predict secondary structure of all domains in SCOP (filtered to twilight zone) • Compare the PSIPRED predictions with DSSP • Get the count of combination of (σ;ρ,c). • σ belongs to {H,E,C,G,B,S,T} • ρ belongs to {H,E,C} • c belongs to {0,1,…,9}

Theory • Scoring secondary structure • Derive 10 3*7 substitution matrices (one for each confidence value) Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c)) • Let • Column i of HMM q have pred sec struct ρiq and confidence value ciq • Column j of HMM p have known sec struct σjp (Note: known sec struct  secondary structure of seed seq of alignment) • Define • Sss(q I p j) = wss Mss(σjp;ρiq ciq) • Empirically Wss is 1/7. • This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm.

Theory • Scoring secondary structure (predicted vs predicted) • The above matrix informs • How much more probable is it to get the predictions ρiq ciq and ρjp cjp for a pair of aligned homologous residues than to get them independently of each other. • Sss(q I p j) = wss Mss(ρiq ciq ρjp cjp) • Empirically Wss is 1/7. • This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm.

Results and Discussion • All-against-All comparison with the following similarity search tools: • Sequence-Sequence • BLAST • Profile-Sequence • PSI-BLAST • HMM-Sequence • HMMER • Profile-Profile • COMPASS • PROF_SIM • Test • Input below the twilight zone • Ability to detect remote homologs • Ability to give high-quality alignments.

Results and Discussion • Different versions of tool used for better juxtaposition of results • HHSearch 0 • Simple profile-profile comparison • Gap opening penalty = -3.5, Gap Extension = -0.2 • Above used instead of transition prob log • HHSearch 1 • Basic HMM-HMM version • HHSearch 2 • Version 1 + inclusion of correlation score • HHSearch 3 • Version 2 + usage of predicted vs predicted secondary structure • HHSearch 4 • Version 3 + usage of predicted vs known secondary structure

Results and Discussion • SCOP (structural classification of proteins) database with filtering for twilight zone used. • Detection of homologs: • Domain in SCOP • Family or superfamily or fold or class • Pair of domains are homologous • If they are members of the same super family • Domains from different classes are classified as non-homologous • We present a chart of TP vs FP • TP  homologous pairs • FP  non-homologous pairs.

Results and Discussion • The figure shows classical sensitivity in the benchmark test.

Results and Discussion • Alternative definition of TP and FP • A pair is a TP • If the domains belong to same SCOP super-family • Or if the seq based alignment gives structural alignment with a “maxSub” score of at least 0.1 • A pair is a FP • If it is from different classes and has 0 MaxSub score • What is MaxSub score? • Informally • Defined such that a value > 0 occurs very rarely by chance • It tells what fraction of the query residues can be superposed structurally with the aligned residues from the other structure. • Formally • Weighted number of aligned pairs that can be superimposed with a maximum distance per pair of 3.5 Angstrom units/number of residues in the query sequence • Pairs with 0 Angstorm deviation  wieght 1 • Pairs with 3.5 Angstorm deviation  wieght 0.5

Results and Discussion • Plot of TP vs FP with new definition of TP and FP

Results and Discussion • Observation • More sensitive tools which use secondary structure (HHSearch 3, 4) improve • Reason • Reclassification of “harder to detect” ones as TP helps the more sensitive tools, since they would detect these.

Results and Discussion(Alignment quality) • Sequence alignment assessed by • Looking at the spatial distances between aligned pair of residues • upon superposition of the 3D structures • 2 scores used. • maxSub score • Drawback • Does not penalize over-prediction • Developer’s score • S Dev = N correct/min (Lq, Lp) • N Correct = No of residue pairs that are present in the max subset identified by maxSub • Lq and Lp = No of residues in the 2 sequences to be aligned. • Modeler’s score • S Mod = N correct / L ali • L Ali = No of aligned residue pairs in the seq alignment. • Does not penalize under-prediction. • Balanced score • S balanced = (S dev + S mod) / 2 • Penalizes both under and over prediction

Results and Discussion

Results and Discussion • HHSearch3 performs the best • Family level • Aligns 58% of all pairs with balanced score >= 0.3 • 1.23 times more than COMPASS • 1.28 times more than PROF_SIM • 1.34 times more than HMMER • 1.57 times more than PSI_BLAST • 4.4 times more than BLAST • Super family level • Aligns 27% of all pairs with balanced score >= 0.3 • 1.7 times more than COMPASS • 1.9 times more than PROF_SIM • 2.2 times more than HMMER • 2.9 times more than PSI_BLAST • 14 times more than BLAST

Results and Discussion • HHSearch3 performs the best • Fold level • Aligns 4.5% of all pairs with balanced score >= 0.3 • 3.3 times more than COMPASS • 6.0 times more than PROF_SIM • 7.3 times more than HMMER • 9.4 times more than PSI_BLAST • 63 times more than BLAST • Actually 4.5% at fold level is a lot • Pairs aligned at fold level are deemed non-homologous by SCOP • So we do not expect any good alignments at all

Conclusion • A generalization of HMM – Sequence alignment • Pairwise alignment of profile HMMs • Algorithm to maximize log-sum-of-odds score • Generalization of log-odds score • Increased sensitivity of 5-10% • Due to derivation of novel correlation score • Statistical methods for • Scoring predicted vs known secondary structure • Predicted vs predicted secondary structure • Uses confidence values of secondary structure prediction

Conclusion • HHPred • New tool based on the research paper • Benchmarking • With 5 other homology detection tools • Dataset in twilight zone • Results • Improvement in • Sensitivity • Alignment quality

Thank you. Have a nice day!

Protein homology detection by HMM–HMM comparison Johannes Söding

Protein homology detection by HMM–HMM comparison Johannes Söding

Presentation Transcript

Protein Homology Modelling

Ruminant Protein Nutrition

Addressing Housing and Food Insecurity with Program Income

Detection and Evasion of Web Application Attacks

Nuclear Magnetic Resonance (NMR) Data Protein–Protein Docking

Large-Scale Copy Detection

Apo calypse now !

Sequence Comparison

ecs236 Winter 2006: Intrusion Detection #2: Vulnerability Analysis

Ch. 6: Face detection

Protein interactions and Pathways

Protein – protein interaction

The Protein

Homology Modeling

Detection of Fraud

Protein Chemistry Basics

DockoMatic : Automated Tool for Homology Modeling and Docking Studies

Lesson 2