1 / 45

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders. Weiwei Zhong. Topics. Background Algorithm Design Test Results. Background. Definitions. What is a Sequence Alignment?. Given 2 or more sequences a scoring scheme. match score

gent
Download Presentation

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Traveling Salesman Problem Algorithms to DetermineMultiple Sequence Alignment Orders Weiwei Zhong

  2. Topics • Background • Algorithm Design • Test Results

  3. Background Definitions

  4. What is a Sequence Alignment? • Given • 2 or more sequences • a scoring scheme • match score • mismatch score • gap penalty Insert gaps in each sequence, so that • all sequences have the same length • maximum pairing score

  5. Scoring Matrix Simplified Scoring • match = 2 • mismatch = -1 • gap penalty = -2 In Practice Scoring matrix

  6. Global vs. Local Alignments Global: entire lengths of sequences F G K – G K G F G K F G K G Local: regions of sequences - - - F G K G K G F G K F G K G - -

  7. Pairwise Alignment vs. Multiple Sequence Alignment (MSA) MSA: more than 2 sequences Pairwise: 2 sequences F G K  G K G F G K F G K G - G K Q G K G - - K F G K G F G K  G K G F G K F G K G

  8. Background Basic Dynamic Programming

  9. Dynamic Programming Algorithm for Pairwise Alignments • Two sequences • GAATTC • GGATC 1. Initialization G A A T T C G G A T C • Scoring scheme • match = 2 • mismatch = -1 • gap penalty = -2

  10. 2 0 -1 -1 -1 -1 2 1 -1 -2 -2 -2 0 4 3 1 -1 -3 -1 2 3 5 3 1 -1 0 1 3 4 5 cj 2. Table fill Mi-1,j-1 + S(ci, cj) Mi,j-1 + g Mi-1,j + g Mij = max ci G A A T T C G G A T C • Scoring scheme • match = 2 • mismatch = -1 • gap g = -2

  11. 3. Trace back G A A T T C 0 0 0 0 0 0 0 G G A T C 0 0 -1 -1 -1 -1 2 0 2 1 -1 -2 -2 -2 0 0 4 3 1 -1 -3 0 -1 2 3 5 3 1 0 -1 0 1 3 4 5 G A A T T C | | | | G G A – T C

  12. Multidimensional Dynamic Programming for MSA • n strings of length L each, running time is O(Ln). • Impractical: 5-7 proteins of 200-300 residues each.

  13. Topics • Background • Algorithm Design • Test Results

  14. Algorithm Design An MSA Heuristic

  15. cj T A ci S * Feng-Doolittle Progressive Alignment • 1. Align 2 of the sequences Si, Sj • 2. Align a 3rd sequenceSkto the alignment Si, Sj • 3. Repeat 2 until all sequences are aligned S(ci, cj) = (S(T, S) + S(A, S)) / 2 Running Time O( n L2 )

  16. Features of Feng-Doolittle Algorithm • Once a gap, always a gap • Early mistakes cannot be corrected Alignment order is important x: G A A G T T y: G A – C T T z: G A A C T G x: G A A G T T y: G A C – T T z: G A A C T G

  17. Algorithm Design TspMsa: First Version

  18. Traveling Salesman Problem (TSP) • Given • n nodes • distances for each pair of nodes • Find a roundtrip, so that • visit each node exactly once • minimal total length NP-complete Well studied

  19. TspMsa: Algorithm Design 0 1 2 3 4 0 calculate pairwise distances 0 1 2 3 4 1 2 3 4 determine a TSP tour 0 1 2 3 4 0 Alignment order Feng-Doolittle alignment 2 4 3 1

  20. Starting Point and Direction of TSP Tour 498 429 337 814 508 624 375 542 8 632 932 970 84 1 251 14 378 79 914 284 1049 15 9 0.703 0.747 data set kinase_ref3 0.770 0.703 0.737 0.67 0 9 1 10 4 0.749 8 0.702 0.665 0.74 0.653 2 7 0.636 0.722 0.736 0.636 0.702 0.681 6 0.603 3 0.736 0.654 0.689 0.743 5 18 0.677 0.668 0.64 0.731 0.669 19 17 0.733 0.712 0.739 0.686 0.656 0.706 14 20 0.696 0.712 0.685 0.719 0.772 0.711 15 21 16 22 0.7 11 13 0.692 12 0.698 0.765 0.688 0.746 0.685

  21. Algorithm Design TspMsa: Modified Design

  22. 1 0 67 1 2 24 24 15 3 4 38 67 1, 0 1, 0 67 2 24 15 2, 4 3 3 4 38 38 3, 1, 0 67 2, 4 38 TspMsa: Modified Algorithm Design calculate pairwise distances determine a TSP tour align closest nodes no one node left ? 3 1 yes 3, 1, 0, 2, 4 0 end 2 4

  23. Modified Algorithm is Better Alignment order for Kinase_ref3 6 8 10 9 0 1 4 2 3 18 17 15 16 11 12 13 22 21 20 19 5 7 14 Original TspMsa : 0.603 (worst) - 0.772 (best) Modified TspMsa : 0.836

  24. Topics • Background • Algorithm Design • Test Results

  25. Test Results What to Compare With?

  26. best quality Fast Existing MSA Programs Iterative Progressive clustalw saga multal prrp multalign pileup poa hmmt less computation time better quality

  27. 2 3 1 4 9 5 8 6 7 repeat until one node left at the center i i x 2 3 1 j j 4 9 ri=(Σdik)/(n-2) dix=(dij + ri - rj) /2 djx=dij – dix dxm=(dim + djm - dij)/2 5 8 7 6 9 4 3 2 1 8 7 6 5 CLUSTALW 1. Calculate pairwise distances 2. Derive a guide tree by the Neighbor Joining method choose 2 closest nodes, derive an internal node

  28. CLUSTALW 3. Progressively align all sequences following the guide tree • Weighted sequences Without weights Score = [S(t,v) + S(l,v)] / 2 1p e e k s a v t a l 2g e e k a a v l a l 3e g e w q l v l h v With weights Score = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2 • 2 gap penalty values: opening, extension • Dynamically changes the gap penalty and the scoring matrix

  29. T N K E POA 1. Convert sequences to partial order graphs E T N K E T - - P K M I V R E T T H – K M L V R P I M V R T K E T H L

  30. POA 2. Align 2 sequences 3. Align one sequence to the current group P T T H K E E T N K 4. Repeat 3 until all sequences are aligned

  31. Test Results Quality Evaluation

  32. BAliBASE Benchmark • Reference 1: equidistance sequences with various levels of similarity. • < 25% sequence identity • 20-40% sequence identity • > 35% sequence identity • Reference 2: closely related sequences with a highly divergent “orphan” sequence. • Reference 3: subgroups with <25% identity between groups. • Reference 4: sequences with N/C-terminal extensions. • Reference 5: sequences with internal insertions.

  33. Reference 1 Sequences with < 25% Identity short medium long All Test Scores Average Score

  34. Reference 1 Sequences with 20-40% Identity short medium long All Test Scores Average Score

  35. Reference 1 Sequences with >35% Identity short medium long All Test Scores Average Score

  36. Reference 2 short medium long All Test Scores Average Score

  37. Reference 3 short medium long All Test Scores Average Score

  38. Reference 4 and Reference 5 Reference 4 Reference 5 All Test Scores Average Score

  39. Alignment Quality Comparison TspMsa and POA: TspMsa better TspMsa and CLUSTALW: comparable Reference 1: <25% identity: Similar * 20-40% identity: Similar * > 35% identity: Similar Reference 2: Similar * Reference 3: TspMsa better Reference 4: CLUSTALW better Reference 5: Similar * CLUSTALW slightly better for short sequences.

  40. Test Results Execution Time Evaluation

  41. Fast Mode TspMsa Most time consuming step: Pairwise distance calculations • Slow mode: • full dynamic programming (accurate) • Fast mode: • a fast approximate method (heuristic)

  42. Quality Impact of the Fast Mode

  43. Execution Time Evaluation CLUSTALW and TspMsa in fast mode

  44. Conclusions • Slow mode • close to CLUSTALW (slow mode) • better than POA • Fast mode(not as good as slow mode) • comparable to CLUSTALW (fast mode) • better than POA • Fast mode • faster than CLUSTALW (fast mode) • comparable to POA QUALITY SPEED

  45. Acknowledgement Dr. Robert Robinson Dr. Russell Malmberg Dr. Eileen Kraemer Computer Science Department

More Related