1 / 39

Heuristic Local Aligners

Heuristic Local Aligners. The basic indexing & extension technique Indexing: techniques to improve sensitivity Pairs of Words, Patterns Systems for local alignment. Indexing-based local alignment. ……. Dictionary: All words of length k (~10)

thane
Download Presentation

Heuristic Local Aligners

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heuristic Local Aligners • The basic indexing & extension technique • Indexing: techniques to improve sensitivity • Pairs of Words, Patterns • Systems for local alignment

  2. Indexing-based local alignment …… Dictionary: All words of length k (~10) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold query …… scan DB query

  3. Indexing-based local alignment—Extensions A C G A A G T A A G G T C C A G T Gapped extensions until threshold • Extensions with gaps until score < C below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A

  4. Sensitivity-Speed Tradeoff X% Sens. Speed Kent WJ, Genome Research 2002

  5. Sensitivity-Speed Tradeoff Methods to improve sensitivity/speed • Using pairs of words • Using inexact words • Patterns—non consecutive positions ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCACGGACCAGTGACTACTCTGATTCCCAG…… ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCGCCGACGAGTGATTACACAGATTGCCAG…… TTTGATTACACAGAT T G TT CAC G

  6. Measured improvement Kent WJ, Genome Research 2002

  7. Non-consecutive words—Patterns Patterns increase the likelihood of at least one match within a long conserved region Consecutive Positions Non-Consecutive Positions 6 common 5 common 7 common 3 common On a 100-long 70% conserved region: ConsecutiveNon-consecutive Expected # hits: 1.07 0.97 Prob[at least one hit]: 0.30 0.47

  8. Advantage of Patterns 11 positions 11 positions 10 positions

  9. Multiple patterns TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G • K patterns • Takes K times longer to scan • Patterns can complement one another • Computational problem: • Given: a model (prob distribution) for homology between two regions • Find: best set of K patterns that maximizes Prob(at least one match) How long does it take to search the query? Buhler et al. RECOMB 2003 Sun & Buhler RECOMB 2004

  10. Variants of BLAST • NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/ • MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html • Optimized to align very similar sequences • Works best when k = 4i  16 • Linear gap penalty • WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ • Very good optimizations • Good set of features & command line arguments • BLAT http://genome.ucsc.edu/cgi-bin/hgBlat • Faster, less sensitive than BLAST • Good for aligning huge numbers of queries • CHAOS http://www.cs.berkeley.edu/~brudno/chaos • Uses inexact k-mers, sensitive • PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php • Uses patterns instead of k-mers • BlastZ http://www.psc.edu/general/software/packages/blastz/ • Uses patterns, good for finding genes • Typhon http://typhon.stanford.edu • Uses multiple alignments to improve sensitivity/speed tradeoff

  11. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

  12. Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 Other organisms have much higher polymorphism rates • Population size!

  13. Why humans are so similar Out of Africa N A small population that interbred reduced the genetic variation Out of Africa ~ 40,000 years ago Heterozygosity: H H = 4Nu/(1 + 4Nu) u ~ 10-8, N ~ 104  H ~ 410-4

  14. DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~150 letters at a time

  15. Method to sequence longer regions genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~100 bp ~100 bp

  16. Definition of Coverage C Length of genomic segment: G Number of reads: N Length of each read: L Definition:Coverage C = N L / G How much coverage is enough? Lander-Waterman model: Prob[ not covered bp ] = e-C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

  17. Repeats Bacterial genomes: 5% Mammals: 50% Repeat types: • Low-Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons • SINE(Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies • LINE(Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies • LTRretroposons(Long Terminal Repeats (~700 bp) at each end) cousins of HIV • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100,000-long, very similar copies

  18. Two main assembly problems • De Novo Assembly • Resequencing

  19. Human Genome Variation TGCTGAGA TGCCGAGA TGCTCGGAGA TGC - - - GAGA SNP Novel Sequence Mobile Element or Pseudogene Insertion Inversion Translocation Tandem Duplication TGC - - AGA TGCCGAGA Microdeletion Transposition TGC Novel Sequence at Breakpoint Large Deletion

  20. Read Mapping • Want ultra fast, highly similar alignment • Detection of genomic variation CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...... TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC ......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......

  21. Read Mapping • Modern fast read aligners: BWA, Bowtie, SOAP • Based on Burrows-Wheeler transform CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...... TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC ......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......

  22. Burrows-Wheeler Transform ANA BANANA ANANA NANA ANA NA A BANANA ANANA NANA ANA NA A X = BANANA suffixes of BANANA

  23. Burrows-Wheeler Transform ANA BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $ BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $ X = BANANA$

  24. Burrows-Wheeler Transform ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA X = BANANA$

  25. Burrows-Wheeler Transform ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA X = BANANA$

  26. Burrows-Wheeler Transform ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA X = BANANA$

  27. Burrows-Wheeler Transform ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA X = BANANA$ BWT matrix of string ‘BANANA’ BWT(BANANA) = ANNB$AA

  28. Suffix Arrays $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA 1 $BANANA 2 A$BANAN 3 ANA$BAN 4 ANANA$B 5 BANANA$ 6 NA$BANA 7 NANA$BA Suffixes are sorted in the BWT matrix S(i) = j, where Xj…Xn is the i-th suffix lexicographically X BWT(X) constructed from S: At each position, take the letter to the left of the one pointed by S S BWT(X)

  29. Reconstructing BANANA A$ NA NA BA $B AN AN $B A$ AN AN BA NA NA A$B NA$ NAN BAN $BA ANA ANA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA A N N B $ A A $ A A A B N N $BA A$B ANA ANA BAN NA$ NAN BWT matrix of string ‘BANANA’ append BWT append BWT sort sort sort

  30. Reconstructing BANANA - faster Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA BWT matrix of string ‘BANANA’

  31. Reconstructing BANANA - faster Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA A N N B $ A A $BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B BWT matrix of string ‘BANANA’

  32. Reconstructing BANANA - faster Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA A N N B $ A A $BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B A$BANAN ANA$BAN ANANA$B Same words, same sorted order BWT matrix of string ‘BANANA’

  33. Reconstructing BANANA - faster Lemma. The i-th occurrence of character ‘a’ in last column is the same text character as the i-th occurrence of ‘a’ in the first column LF(): Map the i-th occurrence of character ‘a’ in last column to the first column LF(r): Let row r contain the i-th occurrence of ‘a’ in last column Then, LF(r) = r’; r’: i-th row starting with ‘a’ $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA BWT matrix of string ‘BANANA’

  34. Reconstructing BANANA - faster LF(r): Let row r be the i-th occurrence of ‘a’ in last column Then, LF(r) = r’; r’: i-th row starting with ‘a’ $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA BWT matrix of string ‘BANANA’ LF[] = [2, 6, 7, 5, 1, 3, 4] Row LF(r) is obtained by rotating row r one position to the right

  35. Reconstructing BANANA - faster $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA LF[] = [2, 6, 7, 5, 1, 3, 4] • Computing LF() is easy: • Let C(a): # of characters smaller than ‘a’ • Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5 • Let row r end with the i-th occurrence of ‘a’ in last column • Then, LF(r) = C(a) + i(why?) BWT matrix of string ‘BANANA’

  36. Reconstructing BANANA - faster A N N B $ A A $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA C() copied for convenience C() 1 554011 indicating this is i-th occurrence of ‘c’ index i 1 1 2 1 1 2 3 LF() 2 6 7 5 1 3 4 LF() = C() + i Reconstruct BANANA: S := “”; r := 1; c := BWT[r]; UNTIL c = ‘$’ { S := cS; r := LF(r); c := BWT(r); } BWT matrix of string ‘BANANA’ Credit: Ben Langmead thesis

  37. Searching for ANA L(W): lowest index in BWT matrix where W is prefix U(W): highest index in BWT matrix where W is prefix Example: L(“NA”) = 6 U(“NA”) = 7 Lemma (prove as exercise) L(aW) = C(a) + i +1, where i = # ‘a’s up to L(W) – 1 in BWT(X) U(aW) = C(a) + j, where j = # ‘a’s up to U(W) in BWT(X) Example: L(“ANA”) = C(‘A’) + # ‘A’s up to (L(“NA”) – 1) + 1 = 1 + (# ‘A’s up to 5) + 1 = 1 + 1 + 1 = 3 U(“ANA”) = 1 + # ‘A’s up to U(“NA”) = 1 + 3 = 4 $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA BWT matrix of string ‘BANANA’

  38. Searching for ANA Let LFC(r, a) = C(a) + i, where i = #’a’s up to r in BWT ExactMatch(W[1…k]) { a := W[k]; low := C(a) +1; high := C(a+1); // a+1: lexicographically next char i := k – 1; while (low <= high && i >= 1) { a = W[i]; low = LFC(low – 1, a) + 1; high = LFC(high, a); i := i – 1; } return (low, high); } $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA BWT matrix of string ‘BANANA’ Credit: Ben Langmead thesis

  39. Summary of BWT algorithm Suffix array of string X: S(i) = j, where Xj…Xn is the j-thsuffix lexicographically • BWT follows immediately from suffix array • Suffix array construction possible in O(n), many good O(n log n) algorithms • Reconstruct X from BWT(X) in time O(n) • Search for all exact occurrences of W in time O(|W|) • BWT(X) is easier to compress than X

More Related