1 / 36

Testing sequence comparison methods with structure similarity

Testing sequence comparison methods with structure similarity. @ Organon, Oss 2006-02-07 Tim Hulsen. Introduction. Main goal: transfer function of proteins in model organisms to proteins in humans

kiona-baird
Download Presentation

Testing sequence comparison methods with structure similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing sequence comparison methods with structure similarity @ Organon, Oss 2006-02-07 Tim Hulsen

  2. Introduction • Main goal: transfer function of proteins in model organisms to proteins in humans • Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) • Several ortholog identification methods, relying on: • Sequence comparisons • (Phylogenies)

  3. Introduction • Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation

  4. Previous research • Comparison of several ortholog identification methods • Orthologs should have similar function • Functional data of orthologs should behave similar: • Gene expression data • Protein interaction data • Interpro IDs • Gene order

  5. Orthology method comparison • Compared methods: • BBH, Best Bidirectional Hit • INP, InParanoid • KOG, euKaryotic Orthologous Groups • MCL, OrthoMCL • PGT, PhyloGenetic Tree • Z1H, Z-value > 1 Hundred

  6. Orthology method comparison • e.g. correlation in expression profiles • Affymetrix human and mouse expr. data, using SNOMED tissue classification • Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm

  7. Orthology method comparison

  8. Orthology method comparison • e.g. conservation of protein interaction • DIP (Database of Interacting Proteins) • Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Hs Mm Mm

  9. Orthology method comparison

  10. Orthology method comparison

  11. Orthology method comparison • Trade-off between sensitivity and selectivity • BBH and INP are most sensitive but also most selective • Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?

  12. E-value or Z-value? • Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score rnd ori: 5*SD  Z = 5 O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score

  13. E-value or Z-value? • Z-value calculation takes much time (2x100 randomizations) • Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value? • Advantage of Z-value has never been proven by experimental results

  14. How to compare? • Structural comparison is better than sequence comparison • ASTRAL SCOP: Structural Classification Of Proteins • e.g. a.2.1.3, c.1.2.4; same number ~ same structure • Use structural classification as benchmark for sequence comparison methods

  15. ASTRAL SCOP statistics

  16. Methods (1) • Smith-Waterman algorithms: dynamic programming; computationally intensive • Paracel with e-value (PA E): • SW implementation of Paracel • Biofacet with z-value (BF Z): • SW implementation of Gene-IT • ParAlign with e-value (PA E): • SW implementation of Sencel • SSEARCH with e-value (SS E): • SW implementation of FASTA (see next page)

  17. Methods (2) • Heuristic algorithms: • FASTA (FA E) • Pearson & Lipman, 1988 • Heuristic approximation; performs better than BLAST with strongly diverged proteins • BLAST (BL E): • Altschul et al., 1990 • Heuristic approximation; stretches local alignments (HSPs) to global alignment • Should be faster than FASTA

  18. Method parameters • all: • matrix: BLOSUM62 • gap open penalty: 12 • gap extension penalty: 1 • Biofacet with z-value: 100 randomizations

  19. Receiver Operating Characteristic • R.O.C.: statistical value, mostly used in clinical medicine • Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

  20. ROC50 Example • Take 100 best hits • True positives: in same SCOP family, or false positives: not in same family • For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) • - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50 (0,167) • - Take average of ROC50 scores for all entries

  21. ROC50 results

  22. Coverage vs. Error • C.V.E. = Coverage vs. Error (Brenner et al., 1998) • E.P.Q. = selectivity indicator (how much false positives?) • Coverage = sensitivity indicator (how much true positives of total?)

  23. CVE Example • Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 • True positives: in same SCOP family, or false positives: not in same family • For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives • For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries • - Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best

  24. CVE results - + (only PDB095)

  25. Mean Average Precision • A.P.: borrowed from information retrieval search (Salton, 1991) • Recall: true positives divided by number of homologs • Precision: true positives divided by number of hits • A.P. = approximate integral to calculate area under recall-precision curve

  26. Mean AP Example • - Take 100 best hits • - True positives: in same SCOP family, or false positives: not in same family • For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) • Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) • Take average of AP scores for all entries = mean AP

  27. Mean AP results

  28. Time consumption • PDB095 all-against-all comparison: • Biofacet: multiple days (z value calc.!) • BLAST: 2d,4h,16m • SSEARCH: 5h49m • ParAlign: 47m • FASTA: 40m

  29. Preliminary conclusions • SSEARCH gives best results • When time is important, FASTA is a good alternative • Z-value seems to have no advantage over E-value

  30. Problems • Bias in PDB? • Sequence length • Amino acid composition • Difference in matrices? • Difference in SW implementations?

  31. Bias in PDB sequence length?  Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets

  32. Bias in PDB aa distribution?  No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets

  33. Difference in matrices?

  34. Difference in SW implementations?

  35. Conclusions • E-value better than Z-value! • SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scoresbest of all • Larger structural comparison database needed for better analysis

  36. Credits • NV Organon: • Peter Groenen • Wilco Fleuren • Wageningen UR: • Jack Leunissen

More Related