1 / 37

Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap

Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap. Gavin A. Price, Gavin E. Cooks, Richard E. Green and Steven E. Brenner. Goals. Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm

larissa
Download Presentation

Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap Gavin A. Price, Gavin E. Cooks,Richard E. Green and Steven E. Brenner

  2. Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families

  3. Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families

  4. Background • What is a superfamily? • Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies‡ ‡-Introduction to Structural Classification of Proteins, http://scop.mrc-lmb.cam.ac.uk/scop/intro.html

  5. Building the Database • SCOP (version 1.61) • Evolutionary interrelations are known • Homologous domains placed in same superfamily • No two sequences have >40% pairwise identity • Focus on remote homologs • Subset is non-redundant • ASTRAL provides SCOP subsets • Further divided into training (2592 sequences) and test (2182 sequences) sets

  6. Alignment scoring • Collect alignment scores (E-values) from pair-wise alignment of all sequences in the database • E-value: number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search • Vary the alignment score threshold • Allow few false positive errors with stringent cutoff • Allow more errors but find more true relations with more permissive threshold • Plot the results in a coverage versus error per query (CVE) plot

  7. Many false positives, few true homologs Few false positives, many true homologs Example CVE plot Number of false positive matches Proportion of true relations found

  8. Superfamily Size Normalization • Why Normalize? • Large superfamilies account for most of the true relations in the database • Protein domains are more diverse • Interrelations are harder to discover • Unnormalized coverage is dominated by largest SCOP superfamilies • As superfamily size grows, the number of relationships will grow quadratically • Normalization prevents skewing of performance evaluations

  9. 1 1 = n(n-1) n2-n Superfamily Size Normalization • Unnormalized case • Coverage is the fraction of all true relations that are found • Each correctly deduced sequence relationship contributes to the coverage • n – number of sequences in the database • n2-n – number of possible sequence relationships • n -1 – weight of each sequence match

  10. Superfamily Size Normalization • Two possible types of Normalization • Linear • Quadratic • Normalization downweights larger superfamilies

  11. 1 n(s – 1) Superfamily Normalization • Linear Normalization • Average fraction of true relations per sequence • Weight of each sequence match divided by no. of true homologs of the query: • (s-1) – no. of true homologs in the query • s – superfamily size

  12. 1 S(s2– s) Superfamily Normalization • Quadratic Normalization • Average fraction of true relations per superfamily • Weight of each sequence match is divided by the total coverage of the superfamily, i.e. number of true relations within the superfamily: • (s2-s) – number of true relations within the superfamily • s – size of superfamily • S – number of superfamilies in the database

  13. Standard Bootstrapping • Tests if measured results are statistically significant, given finite size of datasets • Randomly sample N sequences with replacement • N – original database size • Collect many replica sets, analyze, and interpret

  14. Standard Bootstrapping • Problems • Replicas are biased relative to the original data, underestimating the true coverage • Why? • Each sequence is represented zero, one or more times • Chance of not including a particular sequence is ~1/e or 37% (Weights are approx. Poissonian) • Each replica has a 60% chance of entirely neglecting each size-2 superfamily

  15. Standard Bootstrapping • Problems • Sequences preferentially selected from larger, more diverse superfamilies • Correct sequence relationships are harder to discover • Undersampling leads to unwarranted reduction in homology

  16. Thick lines represent original CVE Lighter areas represent bootstrapped replicas Standard Bootstrap

  17. Bayesian Bootstrap • Prior Assumption: All sequences are equally probable • Sample sequences with replacement • Combine prior with sample likelihood via Bayes theorem • Result is the Dirichlet posterior distribution on the fraction of the original population that each sampled sequence represents

  18. Bayesian Bootstrap • With non-integer weights, smaller superfamilies do not drop out • Bias is eliminated

  19. Overpredicted variance • Replicas are skewed left Comparison of Bootstrap Methods

  20. Normalization Formulas • wi – weight of query sequence • wj – weight of target sequence • S – number of superfamilies *Summing to s includes only weights of sequences in query’s superfamily

  21. Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families

  22. Example of Methodology • Compare performance of four matrix families • Choose a single matrix from each family based on ability to align remote homologs given that it matches close homologs relatively easily • Use SSEARCH algorithm • Implementation of Smith-Waterman algorithm

  23. Example of Methodology • Matrix families tested: • Classic PAM matrices (Dayhoff) • VTML (variable time maximum likelihood) matrices • Based on Dayhoff model • Trained on large set of diverse homologs • Standard BLOSUM matrices • Derived from the BLOCKS 5 database of protein sequence alignments • BLOCKS 13+ BLOSUM matrices • BLOSUM family reparameterized using BLOCKS 13+ database, contains many more sequences than standard BLOCKS 5

  24. Selecting Matrices • Vary matrix number in each family • PAM: • 10  310 • VTML: • 50  350 • BLOSUM: • 30  100 • BLOCKS 13+ BLOSUM: • 40  100 • Vary gap open (520) & extension (13) penalties

  25. Coverage v. Matrix No. & Gap Params Global Maxima

  26. Optimal Parameters • Parameters are optimized for coverage on the training dataset under linear normalization at 0.01 EPQ

  27. Coverage Results * Standard Deviation from bootstrapping test datasets

  28. BLOSUM VTML PAM BLOCKS 13+ BLOSUM Composite CVE • Under Bayesian Bootstrap • Thick lines are original • Thin lines are bootstrap replicas

  29. Bootstrap Distributions • Replicate distribution for each family at 0.01 EPQ • PAM is worse than others BLOSUM VTML PAM BLOCKS 13+ BLOSUM

  30. Statistic of Interest • Interested in mean difference in coverage • Different than difference in mean coverage • Does not underestimate statistical significance • Results obtained from a single data replica are correlated across different parameters • Rather than generate independent replicas for each method

  31. VTML and BLOSUM nearly equivalent BLOCKS 13+ and PAM differ significantly Mean Difference in Coverage • Most Extreme Cases Shown

  32. Test for Statistical Significance • Compute Z-statistic (mean / standard deviation) • If |Z-statistic| > 1.96, reject the hypothesis that the methods posses equivalent performance at 95% confidence

  33. Z-Scores For Distribution Pairs *Statistics are of the form (difference in mean coverage)/(mean difference in coverage) The latter is the statistic of interest

  34. Summation • PAM is least effective • Trained on a small collection of relatively close homologs • VTML performs better than PAM but not BLOSUM or BLOCKS 13+ BLOSUM • Result of training on larger sets of remote homologs

  35. Summation • Bayesian Bootstrap • Estimate statistics in database homology search • Lacks anomalies introduced by standard bootstrap • Why? • Bayesian does not underrepresent small superfamilies

  36. Alternative to Bayesian Bootstrap • Resample entire superfamilies • Largest superfamilies contain majority of the intersequence relations • Leads to noisy replica ensembles • Assumes transitivity • Groups sequences into non-overlapping families • Many interesting datasets are not transitive • Multi-domain protein sequences • Not universally applicable

  37. Why Accept Bayesian Bootstrap? • Researchers predict most undiscovered superfamilies contain few sequences • Bias inherent to standard bootstrap will be exacerbated as sequence databases grow • Bayesian bootstrap provides better resampling without increasing computational or conceptual complexity

More Related