1 / 38

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance Specificity of Short Oligonucleotides. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Polymerase Chain Reaction. Polymerase Chain Reaction. Primer Specificity.

temima
Download Presentation

Precomputing Edit-Distance Specificity of Short Oligonucleotides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

  2. Polymerase Chain Reaction

  3. Polymerase Chain Reaction

  4. Primer Specificity • Need to ensure that primers hybridize to a specific (specified) locus only • Require exactly one occurrence of specified sequence • Require no (potential) mis-hybridization loci • Bottleneck computation in primer-design • Design / check iteration is problematic

  5. k-unique 20-mers • Edit-distance as a surrogate for mis-hybridization potential • k-unique loci: • All non-self genomic loci are require more than k edits in (global) alignment • Closest non-self genomic loci requires (k+1) edits in (global) alignment

  6. Find all k-unique 20-mers • Naïve algorithm: O(n2km) • Quadratic in size of genome. • 0-unique (exact match) 20-mers • (Expected) linear time algorithm • Achieve expected linear time using a hybrid approach (blastn): • Use partial exact match to “seed” expensive dynamic programming alignment • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, but correct

  7. Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Largest correct chunk: floor(m/(k+1)) g ≠ = ≠ q

  8. Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

  9. AA:18 AC:1,9 AG:11,19 CA:8,10 CC:14 CT:2,15 GC:7 GT:5,12 TA:17 TC:13 TG:4,6 TT:3,16 Brute-force approach ACTTGTGCACAGTCCTTAAG 2-mer position table

  10. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  11. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  12. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  13. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  14. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  15. Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

  16. Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 )l=10; k=2 )l=6; k=3 )l=5 ; k=4 )l=4.

  17. Brute-force approach Things are looking really, really, bad: • Seeds are too short • 90,000 pair-wise block comparisons Actually quite good (seed size 12): • Non-uniqueness certificates are dense • Almost all positions eliminated early • Behaves more like linear time than quadratic

  18. In practice (edit-dist 4)

  19. In practice (edit-dist 4)

  20. In practice (edit-dist 4)

  21. In practice (edit-dist 3)

  22. In practice (edit-dist 3)

  23. In practice (edit-dist 4,3,2)

  24. In practice (edit-dist 4,3,2)

  25. Edit distance 2 • After seed size 12 • ~ 27K (0.288%) positions have no match • After seed size 8 • ~ 3K (0.029%) positions have no match • Using seed size 6 is still too slow • Need a more sophisticated hashing strategy • 6-mers match in too many places!

  26. Spaced seed-set design problem • Given: • mer-size: m ( = 20 ) • # errors: k ( = 1,2,3) • # cares: l ( = 10,12,14 ) • Find the smallest set of spaced seeds that will find all alignments.

  27. Solution for (20,2,8) • 11111111, 111101111 TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT • How can we find these spaced seed set solutions?

  28. Spaced seed set design set-cover formulation • Set cover instance: • Ground set: all possible placements of the k errors (alignments) • Covering sets: all possible placements of the l care positions • For (m=20,k=2,l=10), • 190 elements, 184,756 sets! • Need to reduce the number of sets!

  29. Dirty secret of spaced seeds • Spaced seeds take O(# cares) to update! • Contiguous seeds are O(1) to update • 101010101010101 vs 11111111 • 8 steps to update vs 1 step to update • Constant time update for spaced seeds? • Yes, if they have a certain structure

  30. O(1) spaced seed update ACGTACGTACGTACGTACGT A G A G C T C T G A G A T C T C ... Spaced seed 1010101 can be updated in 1 step!

  31. O(1) spaced seed update • “Periodic” spaced seeds can be updated in “constant” time • 11011011011 2 steps • 11001100110011 2 steps • 1000010000100001 1 step • Need to minimize the number of update steps, not the number of templates • 11111111,111101111 has update cost 5.

  32. Edit-distance SS-SDP • Position of matching bases might shift! • Need 11111111 ↓ to get CCGCTAGA • Need 111101111 ↑ to get CCGCTAGA • Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

  33. Edit-Distance SS-SDP • Use a variation on set cover: • q:111101111,r:11111111 covers: • Pay for query & reference update costs separately • Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

  34. Solution for (20,2,10) Query Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 27: 11111000001111100000 Cost: 5 42: 11111000000001111100 Cost: 5 Text Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 32: 11111000000111110000 Cost: 5 37: 11111000000011111000 Cost: 5 42: 11111000000001111100 Cost: 5 Pairs of templates: 1: 11111111110000000000 1: 11111111110000000000 Covers: 1274 2: 11111011111000000000 1: 11111111110000000000 Covers: 260 2: 11111011111000000000 2: 11111011111000000000 Covers: 1218 1: 11111111110000000000 2: 11111011111000000000 Covers: 309 42: 11111000000001111100 32: 11111000000111110000 Covers: 42 27: 11111000001111100000 32: 11111000000111110000 Covers: 319 42: 11111000000001111100 37: 11111000000011111000 Covers: 186 27: 11111000001111100000 37: 11111000000011111000 Covers: 51 42: 11111000000001111100 42: 11111000000001111100 Covers: 287

  35. k-unique human 20-mers • No 4-unique 20-mers • No 3-unique 20-mers • 0. 038% of (forward) human 20-mers are 2-unique • 1088322 in total • about 1 every 2638 bases • Fast 2-uniquness oracle

  36. F. tularensis 20-mer signatures • Exact match in all six strains • No match to bacterial background at edit-distance k • No 3-unique 20-mer signatures • 263 2-unique 20-mer signatures • 0.013% • 1.3M 20-mer signatures (no background check) • 1.2M 0-unique 20-mer signatures • 580K 1-unique 20-mer signatures

  37. Conclusions • Precompute of human k-unique 20-mers is now feasible! • Faster for large edit-distance! • Need spaced seed-set designs • Constant time update for spaced seeds • Good integer programming formulation of SS-SDP • Limited template enumeration based on update cost • Work with integer programming experts to solve effectively

  38. Next Steps • Publish! • Adapt for Tm and/or hybridization model • Convert to native BOINC-application • Integrate with primer-design software

More Related