precomputing edit distance specificity of short oligonucleotides n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Precomputing Edit-Distance Specificity of Short Oligonucleotides PowerPoint Presentation
Download Presentation
Precomputing Edit-Distance Specificity of Short Oligonucleotides

Loading in 2 Seconds...

play fullscreen
1 / 42

Precomputing Edit-Distance Specificity of Short Oligonucleotides - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Precomputing Edit-Distance Specificity of Short Oligonucleotides. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Polymerase Chain Reaction. Polymerase Chain Reaction. Primer Specificity.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Precomputing Edit-Distance Specificity of Short Oligonucleotides' - roanna-clark


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
precomputing edit distance specificity of short oligonucleotides

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Nathan Edwards

Center for Bioinformatics and Computational Biology

University of Maryland, College Park

primer specificity
Primer Specificity
  • Need to ensure that primers hybridize to a specific (specified) locus only
    • Require exactly one occurrence of specified sequence
    • Require no (potential) mis-hybridization loci
  • Bottleneck computation in primer-design
    • Design / check iteration is problematic
k unique 20 mers
k-unique 20-mers
  • Edit-distance as a surrogate for mis-hybridization potential
  • k-unique loci:
    • All non-self genomic loci are require more than k edits in (global) alignment
    • Closest non-self genomic loci requires (k+1) edits in (global) alignment
find all k unique 20 mers
Find all k-unique 20-mers
  • Naïve algorithm: O(n2km)
    • Quadratic in size of genome.
  • 0-unique (exact match) 20-mers
    • (Expected) linear time algorithm
  • Achieve expected linear time using a hybrid approach (blastn):
    • Use partial exact match to “seed” expensive dynamic programming alignment
    • Large chunks ) Fast, but miss occurrences
    • Small chunks ) Slow, but correct
inexact sequence match
Inexact sequence match

Baeza-Yates Perleberg:

  • Correct and O(n) for small k
  • At least 1 chunk is observed with no error.
  • Small k → Large chunks → Fast and correct
  • Largest correct chunk: floor(m/(k+1))

g

≠ = ≠

q

example worst case alignments
Example worst case alignments

TCCCGC-TAGATTGAGATCT

||||||v||||||*||||||

TCCCGCCTAGATTTAGATCT

ACTTGTCCACAGTGCTTAAG

||||||*||||||*||||||

ACTTGTGCACAGTCCTTAAG

brute force approach

AA:18

AC:1,9

AG:11,19

CA:8,10

CC:14

CT:2,15

GC:7

GT:5,12

TA:17

TC:13

TG:4,6

TT:3,16

Brute-force approach

ACTTGTGCACAGTCCTTAAG

2-mer position table

brute force approach1
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach2
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach3
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach4
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach5
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach6
Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

brute force approach7
Brute-force approach

Divide the genome into 10 Mb blocks

For all pairs of blocks:

For all l-mer matches:

Do all pair-wise DPs containing match

If ≤ k edits, mark position non-unique

300 x 300 pairs of blocks

For 20-mers: k=1 )l=10; k=2 )l=6; k=3 )l=5 ; k=4 )l=4.

brute force approach8
Brute-force approach

Things are looking really, really, bad:

  • Seeds are too short
  • 90,000 pair-wise block comparisons

Actually quite good (seed size 12):

  • Non-uniqueness certificates are dense
  • Almost all positions eliminated early
  • Behaves more like linear time than quadratic
edit distance 2
Edit distance 2
  • After seed size 12
    • ~ 27K (0.288%) positions have no match
  • After seed size 8
    • ~ 3K (0.029%) positions have no match
  • Using seed size 6 is still too slow
    • Need a more sophisticated hashing strategy
    • 6-mers match in too many places!
spaced seed set design problem
Spaced seed-set design problem
  • Given:
    • mer-size: m ( = 20 )
    • # errors: k ( = 1,2,3)
    • # cares: l ( = 10,12,14 )
  • Find the smallest set of spaced seeds that will find all alignments.
solution for 20 2 8
Solution for (20,2,8)
  • 11111111, 111101111

TCCCGCGTAGATTGAGATCT

||||||*||||||*||||||

TCCCGCCTAGATTTAGATCT

  • How can we find these spaced seed set solutions?
    • One/two table? 2 x false positives
spaced seed set design set cover formulation
Spaced seed set design set-cover formulation
  • Set cover instance:
    • Ground set: all possible placements of the k errors (alignments)
    • Covering sets: all possible placements of the l care positions
  • For (m=20,k=2,l=10),
    • 190 elements, 184,756 sets!
    • Need to reduce the number of sets!
dirty secret of spaced seeds
Dirty secret of spaced seeds
  • Spaced seeds take O(# cares) to update!
    • Contiguous seeds are O(1) to update
  • 101010101010101 vs 11111111
    • 8 steps to update vs 1 step to update
  • Constant time update for spaced seeds?
    • Yes, if they have a certain structure(s)
    • Restrict spaced seeds to small update cost
o 1 spaced seed update
O(1) spaced seed update

ACGTACGTACGTACGTACGT

1: A G A G

2: C T C T

...

Spaced seed 1010101 can be updated in 1 step!

o 1 spaced seed update1
O(1) spaced seed update

ACGTACGTACGTACGTACGT

ACGTACG -> ACGACG

CGTACGT -> CGTCGT

...

Spaced seed 1110111 can be updated in 1 step!

o 1 spaced seed update2
O(1) spaced seed update
  • “Period” step update
    • 11001100110011 2 steps
    • 1000010000100001 1 step
  • “Runs” step update
    • 11100111111 1 step
    • 11101110111 2 steps
  • Minimize the number of update steps
    • Weighted set-cover instance…
edit distance ss sdp
Edit-distance SS-SDP
  • Position of matching bases might shift!
  • Need 11111111 ↓ to get CCGCTAGA
  • Need 111101111 ↑ to get CCGCTAGA
  • Set cover formulation no longer works

TCCCGC-TAGATTGAGATCT

||||||v||||||*||||||

TCCCGCCTAGATTTAGATCT

edit distance ss sdp1
Edit-Distance SS-SDP
  • Use a variation on set cover:
    • q:111101111,r:11111111 covers:
  • Pay for query & reference update costs separately
  • Control size of problem by only enumerating templates with small update cost

r:TCCCGC-TAGATTGAGATCT

||||||v||||||*||||||

q:TCCCGCCTAGATTTAGATCT

correct solutions for 1 unique 20 mers
Correct solutions for 1-unique 20-mers

107 random sequence x 107 random sequence

  • Seed: 1111111111 (Best single seed solution, weight 10)
    • ~ 9.5 expensive dynamic programs per locus
  • Seed set: 11111111111;111110111111 (weight 11)
    • ~ 8.9 DP/locus
  • Seed set (weight 11)
    • ~ 7.8 DP/locus
  • Seed set: 111111111111;1111101111111 (weight 12)
    • ~ 2.2 DP/locus

11111111111 ~ 111101111111111101111111 ~ 11111111111

111101111111 ~ 111101111111

correct solutions for 1 unique 20 mers1
Correct solutions for 1-unique 20-mers

107 random sequence x 107 random sequence

  • Seed set: 111111111111;1111101111111 (weight 12)
    • ~ 2.5 DP/locus
  • Seed set (weight 12)
    • ~ 1.8 DP/locus
  • Seed set: 1111111111111;11111101111111 (weight 13)
    • ~ 0.56 DP/locus (same specificity as contiguous seed weight 12)
  • Seed set (weight 14)
    • ~ 0.20

1111110111111 x 111111111111 1111110111111

111111001111111

11111111111111 ~ 11111111111111 111110111111111 ~ 11111111111111 111111111110111 ~ 11111111111111 111111111110111 ~ 111111111110111 11111111111111 ~ 111111101111111111110111111111 ~ 111110111111111

correct solutions for 2 unique 20 mers
Correct solutions for 2-unique 20-mers
  • Seed: 111111 (Best single seed solution, weight 6)
    • ~ 2439 DP/locus
  • Weight 10 – 73 DP/locus (specificity of 8/9 contig seed)
  • Weight 12 – 10 DP/locus (specificity of 10 contig seed)

1111111111 ~ 1111111111

11111011111 ~ 1111111111

11111011111 ~ 11111011111

1111100000011111 ~ 1111100000011111

11111000000011111 ~ 1111100000011111

111110000000011111 ~ 1111100000011111

111110000011111 ~ 1111100000011111

111110000000011111 ~ 11111000000000011111

k unique human 20 mers
k-unique human 20-mers
  • No 4-unique 20-mers
  • No 3-unique 20-mers
  • 0. 038% of (forward) human 20-mers are 2-unique
    • 1088322 in total
    • about 1 every 2638 bases
    • Fast 2-uniquness oracle
genome browser track
Genome Browser Track

Edit Distance:

  • UCSC Track: 1-unique 20-mers
  • UCSC Track: 2-unique 20-mers
integration with high performance computing
Integration with High Performance Computing
  • Break sequence into chunks of size 107.
    • Remember which positions have been eliminated.
  • Integrated with (UMIACS) Condor
    • Too unreliable for very large sequences
    • NFS filesystem is unreliable
    • Simultaneous jobs capped at ~ 300
  • Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab)
    • Reliability improved, DFS solves (some) filesystem woes
    • Much better scalability (in theory, yet to be tested)
    • Explicit synchronization of reduce step is undesirable.
other improvements
Other improvements
  • Other groupings are possible:
    • Species designation on FASTA defline can be any suitable partition
  • Constraints on the position of edits:
    • False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies
  • Constraint on valid Tm range:
    • Computed as in Primer3
    • Can eliminate undesirable mers early
conclusions
Conclusions
  • Precompute of human k-unique 20-mers is now feasible!
    • Faster for large edit-distance!
    • Need spaced seed-set designs
  • Constant time update for spaced seeds
  • Good integer programming formulation of SS-SDP
    • Limited template enumeration based on update cost
    • Work with integer programming experts to solve effectively