1 / 60

De novo identification of repeat families in large genomes

De novo identification of repeat families in large genomes. Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005. What is a repeat family?. A repeat family is a collection of similar sequences which appear many times in a genome.

coty
Download Presentation

De novo identification of repeat families in large genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005

  2. What is a repeat family? A repeat family is a collection of similar sequences which appear many times in a genome. For example, the Alu repeat family has over 1 million approximate occurrences in the human genome: Alu Alu Alu Alu Alu

  3. Identifying repeat families: problem formulation INPUT: Genome containing approximate Alu occurrences Alu Alu Alu Alu Alu OUTPUT: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC + consensus sequences of all other repeat families in genome

  4. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu

  5. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu

  6. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus

  7. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Difficulties:

  8. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus • Difficulties: • Regions containing repeat occurrences are not known a priori

  9. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus • Difficulties: • Regions containing repeat occurrences are not known a priori • Repeat boundaries are not known a priori

  10. Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus • Difficulties: • Regions containing repeat occurrences are not known a priori • Repeat boundaries are not known a priori • Many repeat occurrences appear as partial copies

  11. Identifying repeat families: a difficult problem “The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.” Bao and Eddy, 2002 In this talk, we present a simple and efficient algorithm for solving this problem.

  12. Why is identifying repeat families important? 1. Repeats are biologically meaningful Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in • Genome rearrangements (Kazazian, 2004) • Drift to new biological function (Kidwell and Lisch, 2001) • Increased rate of evolution under stress (Capy et al, 2000)

  13. Why is identifying repeat families important? 2. Repeat masking • Repeats need to be masked prior to performing most single-species or multi-species analyses. “Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats.” Webb Miller (personal communication)

  14. Why is identifying repeat families important? • Repeats need to be masked prior to performing most single-species or multi-species analyses. GENE1 GENE2

  15. Why is identifying repeat families important? • If repeat families are known, repeats can be masked using RepeatMasker (http://www.repeatmasker.org). GENE1 GENE2

  16. Why is identifying repeat families important? • If repeat families are known … GENE1 GENE2

  17. Identifying repeat families: manual approaches • For widely studied genomes such as human and mouse, libraries of repeat families have been manually curated: • Repbase Update library (http://www.girinst.org) • RepeatMasker library (http://www.repeatmasker.org)

  18. Identifying repeat families: algorithmic approaches • Many, many new genomes are being assembled. How to identify the repeat families present in these genomes? Clearly, algorithmic approaches are needed.

  19. Identifying repeat families: algorithmic approaches All existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities: • Single-linkage clustering (Agarwal and States, 1994) • REPuter (Kurtz et al., 2000) • RepeatFinder (Volfovsky et al., 2001) • RECON (Bao and Eddy, 2002) • RepeatGluer (Pevzner et al., 2004) • PILER (Edgar and Myers, 2005)

  20. Identifying repeat families: algorithmic approaches Disadvantages of using pairwise similarities: • Computational intractability • human genome: ~106 Alus => ~1012 pairwise alignments • Difficulty defining repeat boundaries • “Local sequence alignments do not usually correspond • to the biological boundaries … Difficulty in defining • element boundaries causes problems in clustering • related elements into families.” • Bao and Eddy, 2002

  21. Identifying repeat families: algorithmic approaches Disadvantages of using pairwise similarities: • Computational intractability • Difficulty defining repeat boundaries Our RepeatScout algorithm uses an efficient method of similarity search which enables a rigorous definition of repeat boundaries.

  22. RepeatScout: the main idea Consider a repeat family with many occurrences in a genome: Equivalently, we have: TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG

  23. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: ?

  24. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: ?

  25. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGC Idea: greedily extend 1 bp at a time from short l-mer seed

  26. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCT Idea: greedily extend 1 bp at a time from short l-mer seed

  27. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTC Idea: greedily extend 1 bp at a time from short l-mer seed

  28. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCA Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  29. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCAC Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  30. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  31. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  32. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGA Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  33. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGAC Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  34. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGACG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  35. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

  36. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus Stop extending when most sequences no longer align

  37. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTCACGGACGTACGGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus Stop extending when most sequences no longer align Note: pairwise alignment is a poor boundary criteria.

  38. RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: AGGCGCCTCGCAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence “after it stops aligning to consensus” Stop extending “when most sequences no longer align” First extend right, then extend left in similar manner

  39. Repeat boundaries: the objective function Let S1, …, Snbe strings containing occurrences of a repeat family which share a short l-mer seed. We define the consensus sequence Q of the repeat family to be the sequence which maximizes A(Q; S1, …, Sn) = ∑ka(Q, Sk) where a(Q, Sk) is a fit-preferred alignment score

  40. Repeat boundaries: the objective function Let S1, …, Snbe strings containing occurrences of a repeat family which share a short l-mer seed. We define the consensus sequence Q of the repeat family to be the sequence which maximizes A(Q; S1, …, Sn) = ∑ka(Q, Sk) – c |Q| where a(Q, Sk) is a fit-preferred alignment score c is a repeat frequency threshold

  41. Repeat boundaries: the objective function • A(Q; S1, …, Sn) = ∑ka(Q, Sk) – c |Q| Optimizing the objective function: • Start with Q = short l-mer seed • Greedily extend Q to the right (left) 1 bp at a time. Stop when + many consecutive iterations fail to improve upon the optimal Q. The optimal Q defines the consensus sequence of the repeat family. This provides a rigorous definition of repeat boundaries.

  42. Repeat boundaries: the objective function TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: AGGCGCCTCGCAACGTCTGCTCACGGACGT Greedily extend right/left to optimize A(Q, S1, …, Sn)

  43. RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers.

  44. RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers. However, each repeat family spawns a large number of frequent l-mers and could be repeatedly rediscovered.

  45. RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers. However, each repeat family spawns a large number of frequent l-mers and could be repeatedly rediscovered. To address this, we dynamically adjust l-mer frequencies to exclude contributions from repeat families we have already identified.

  46. RepeatScout: postprocessing We discard very short “repeat families” arising from spurious frequent l-mers. We discard repeat families with less than 10 copies. We may further wish to distinguish between • Low-complexity repeat families • Tandem repeat families • Multicopy exon families • Segmental duplication units • Transposon families

  47. Results: the human Alu family Input: Genome containing approximate Alu occurrences Alu Alu Alu Alu Alu Desired Output: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC

  48. Results: the human Alu family Input: Genome containing approximate Alu occurrences Alu Alu Alu Alu Alu Desired Output: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC RepeatScout Output (on human X chr): 282bp sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC

  49. Results: C. briggsae We benchmarked RepeatScout using the 108Mb C. briggsae genome (Stein et al., 2003), which Stein et al. analyzed using the RECON algorithm (Bao and Eddy, 2002). We ran RepeatMasker (http://www.repeatmasker.org) using either the RECON repeat library or the RepeatScout library as input, and compared the results:

  50. Results: C. briggsae RECONRepeatScout library library 2.0 Mb 23.1 Mb 4.8 Mb

More Related