Download
repeats n.
Skip this Video
Loading SlideShow in 5 Seconds..
Repeats! PowerPoint Presentation

Repeats!

133 Views Download Presentation
Download Presentation

Repeats!

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Repeats!

  2. Introduction • A repeat family is a collection of repeats which appear multiple times in a genome. • Our objective is to identify all families of interspersed repeats within a single genome

  3. Challenges when identifying repeat families . . . . . . • Challenges: • Regions containing repeat occurrences are not known a priori • Repeat boundaries are not known a priori • Many repeat occurrences appear as partial copies

  4. Why are repeats important • Repeats have been implicated in: • Genome rearrangements (Kazazian, 2004; Achaz et al 2003) • Accelerated loss of gene order (Rocha et al, 2003) • Creation of novel biological functions (Lynch et al, 2002) • Increased rate of evolution under stress (Capy et al, 2000)

  5. Identifying repeats de novo • Assume we get a new genome and we know nothing about it, we can: • Use a database of known repeats (RepeatMasker/RepBase) • novel repeat elements may not be in the database • repetitive gene families are never in the database • Identify repeats de novo using sequence analysis

  6. Existing methods for detection of repeat families • Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities: • REPuter (Kurtz et al., 2000) • RepeatFinder (Volfovsky et al., 2001) • RECON (Bao and Eddy, 2002) • RepeatGluer (Pevzner et al., 2004) • PILER (Edgar and Myers, 2005) • RepeatScout (Price et al, 2005)

  7. Mutational forces at play • Over time, indels & substitutions will affect copies of repeat families: • AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCDTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • Require alignments (& gaps) to attempt to reconstruct true repeat boundaries

  8. de novo repeat detection • One approach: self-search with a pairwise local-alignment tool such as BLAST • Number of pairwise alignments grows O(r2) in the copy number of the repeat • Inherent difficulty defining repeat boundaries among collections of pairwise alignments

  9. An example local multiple alignment: • AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC • AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC • AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- • AACAAGCAGACACTTTTATCCATGGTCGTGGTAC--------- • AACAAGCA----CTTTTATCCATAGTCGTGGTA---------- • ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC Alternative methods? • Local multiple alignment A single local multiple alignment uses O(N) space for a genome of length N

  10. Local multiple alignment • Local multiple alignment has the inherent potential to avoid pitfalls associated with pairwise alignment. • But multiple alignment under the SP objective function remains intractable… • Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE). • So why not directly construct a multiple alignment?

  11. Steps 1-3: Chaining seeds from the Input Sequence • The method incorporated three novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered.

  12. Step 4: Gapped Extension • After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries • This is an essential step to consider, assuming we would like to improve repeat boundary predictions • But how can this be done efficiently?

  13. Our approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  14. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Dynamically calculate extension window = 70*e -0.01*|Mi| |Mi| = 200 , l = 10

  15. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use MUSCLE to perform alignment of extension window

  16. HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence

  17. HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Extension successful, continue extending

  18. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  19. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence

  20. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Finished leftward extension, now to the right…

  21. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  22. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Perform MUSCLE alignment on window

  23. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use HMM to detect & unalign unrelated sequence

  24. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Extension successful, continue extending

  25. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGAGCAGCCACCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use MUSCLE to perform alignment of extension window

  26. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Use HMM to detect & unalign unrelated sequence

  27. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Extension failed, stop extending

  28. Wait a moment.. • The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry. • As a result, it is likely that this method forcibly aligns unrelated sequence. • HMMs to detect alignments of unrelated sequence.

  29. Step 5: detecting unrelated sequence • The HMM consists of two hidden states, Homologous and Unrelated. • The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry • i.e. AG=GA=TC=CT. • The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.

  30. 0.5 UUUU H U • Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry: UAA = UAT = UTA = UTT = (fAT)/2 * (fAT)/2 UCC = UCG = UGC = UGG = (fGC)/2 * (fGC)/2 UAC = UAG = UTC = UAG = (fAT)/2 * (fGC)/2 UCA = UCT = UGA = UTT = (fGT)/2 * (fAT)/2

  31. 0.5 UUUUUU H UU • To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.

  32. 0.5 UUUUUUUUUUUU H UU • Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences.

  33. 0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU • Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms.

  34. 0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU