transposable elements te in genomic sequence n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Transposable Elements (TE) in genomic sequence PowerPoint Presentation
Download Presentation
Transposable Elements (TE) in genomic sequence

Loading in 2 Seconds...

play fullscreen
1 / 30

Transposable Elements (TE) in genomic sequence - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Transposable Elements (TE) in genomic sequence . Mina Rho. Contents. Definition De novo identification of repeat families in large genomes (RepeatScout) Alkes L. Price, Neil C. Jones and Pavel A. Pevzner Combined Evidence Annotation of Transposable Elements in Genome Sequences

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Transposable Elements (TE) in genomic sequence' - iorwen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contents
Contents
  • Definition
  • De novo identification of repeat families in large genomes (RepeatScout)

Alkes L. Price, Neil C. Jones and Pavel A. Pevzner

  • Combined Evidence Annotation of Transposable Elements in Genome Sequences

Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, Dominique Anxolabehere

mobile element transposable element
Mobile element/Transposable element

Transposon

- a segment of DNA that can move around to different positions in the genome of a single cell.

- cut out of its location and inserted into a new location.

- consisting of DNA.

Retrotransposon

- copy and paste into a new location.

- the copy is made of RNA and transcribed back into DNA using reverse transcriptase.

- long terminal repeats (LTRs) at its ends.

=> expect to get information of evolution, mutation, changes of amount of DNA in the genome.

definition
Definition
  • Repeat family: a collection of similar sequences which appear many times in a genome.
    • the Alu repeat family has over 1 million approximate occurrences in the human genome
    • ~ 50% Human genome
  • l-mer: substring whose length is l
backgroud
Backgroud
  • The current status on identification method of repeat families
    • Given an existing library of repeat families
      • RepeatMasker
    • De novo identification
      • REPuter (Kurtz et al., 2000)
      • RepeatFinder (Volfovsky et al., 2001)
      • RECON (Bao and Eddy, 2002)
      • RepeatGluer (Pevzner et al., 2004)
      • PILER (Edgar and Myers, 2005)
      • RepeatScout
overview of repeatscout
Overview of RepeatScout
  • Method
    • Builds a table of high frequency l-mers as seeds
    • Extends each seed to a longer consensus sequence
  • Main advantage
    • an efficient method of similarity search which enables a rigorous definition of repeat boundaries.
how to create l mer table
How to create l-mer table

Sequence

i

i+1

i+2

j

k

Hash table

l-mer1

l-mer2

l-mer3

l-mer4

l-mer5

l-mer6

frequency

Position of last occurrence

output of l mer table
Output of l-mer table

AAAAAAAAAAAGATA 8 2920943

AAAAAAAGGAAAGAA 5 2468525

AGGCTTGAACAATGG 3 1425014

AAAAAAAAGAAAGAA 62 3009663

GTTGGTTTCAAAGAA 7 2855871

AAAAAAAATTTTTTT 22 2992836

ATTCAAGTTAAATGG 4 1473342

ATTCAATGTAACCAC 3 1463008

ATGCATGCAATGCAT 9 1788944

ATGCATTTAAAAGAA 3 1464381

AAAAAACTCACTCCA 5 1489159

how to build all positions of repeats

i

i

i

i

How to build all positions of repeats

Sequence

i

i+1

i+2

j

k

Hash table

l-mer1

l-mer2

l-mer3

l-mer4

l-mer5

l-mer6

j

i

i+2

k

slide13

S1

S2

S3

S4

Q1

S5

High frequency l-mer

Q2

Q3

Q4

Query sequence (with l-mer1)

S1

S2

S3

S4

S5

Extending Q maximizing

objective function one nucleotide

at a time

objective function
Objective Function

|Q| : the length of Q

C: minimum threshold on the number of repeat elements

a(Q, Sk): a pairwise fit_preferred alignment score

p: Incomplete-fit penalty

output of optimized q
Output of optimized Q

>R=0

GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTC

GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTG

TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCG

CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAA

>R=1

AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTTTGAAGAGAGTAGTGGTTCTCCCAGCACGCA

GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACCCCCGAGTAGCCTAACTGGGAGGCACCCCCC

AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAAACTTCCAGAGGAACAATCAGGCAGCAACAT

TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCAGGCAAACAGGGTCTGGAGTGGACCTCCAGC

AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAA

CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGATGGGGAAAAAACAGAGCAGAAAAACTG

GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCCTCACCAGCAACGGAACAAAGCTGGACGGAG

AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTCCAAGCTAAAGGAGGAAATTCAAACCCATGG

CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGAGC

TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAA

AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAAGTTTAGAGAAAAAAGAATAAAAAGAAATGA

>R=2

TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTCTCAAACTCCTGGGCTCAAGTGATCCTCCCA

CCTCAGCCTCTTTAATAGATGCGATTA

>R=3

TTTTTATACATGCTGTAGACAATCAATTCACACCTGTACTTTTTTTTAAGGTTGTGTTATTGCACTTTTATACCTCTTGA

CTGGTAGCTGATTTCCTTGAATACCTGTAAGGTAATCACCGGCTCACCAATGAATGTGGTTTTAACAATGGCTCACAGTG

GCTTGGAAAGCCCTCATGGGAAGTATTTCTGAGGAAAAGTGGAGAGTGTGCAGGAATAGTTTTGAAAAACAGAGACAACC

GATGTCCTCCTTCCCTCCCTTGCCTCTCCTCATGTGCCAGGTTTTCTGTTTTCTCCACTATTACAGAATCACCATGTTGT

ATCCTGTGATGAAAAGTTTTTATCTCTTTAATCATCCCATTTCGTCCTCCAGACCTTTTTTTTTCTGGAAGGGTTGTAAG

CAGAAGGGACGAAACATCTTCAGAAAAACACATTATGATATAAACTTAGTGAAAAGATTCATCATATTTAAGAAATGGAC

AGGATGAAATCCTGAATTCATAAAAATTTTAAAAATCAGTTTACATAACATCCATCCCTTTTGTCTCTATCCCTTATCCA

parameter setting and post processing
Parameter setting and post processing
  • Parameter setting
    • Recommend the smallest l = 15
    • For the arbitrary length L,
    • The length of Q up to 10,000bp on each side
    • Remove repeat families with Q < 50
  • Postprocessing
    • Tandem Repeat finder, Nseg
      • Remove repeat families with >50% of their length annotated as low-complexity and tandem repeats
    • RepeatMasker
      • Mask the repeat families based on the library
benchmark
Benchmark
  • C.briggsae genome (108Mb)
  • 7h on a single 0.5 GHz DEC Alpha processor
overview
Overview

Query Sequences: Drosophila melanogaster (Fruit fly) Release 3, 4

Combined evidence model: pipeline of RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, and TE-HMM

- Methods for the annotation of known TE families

- Methods for the annotation of anonymous TE families

Benchmark : FlyBase Release 3.1 annotation

Sensitivity and specificity, characteristics of boundary

tools
Tools
  • Blaster
    • compares a query sequences against a subject databank.
    • Launches one of the BLAST (BLASTN, TBLASTN, BLASTX, TBLASTX).
    • Cut long sequences before launching BLAST and reassembles the results.
  • MATCHER
    • Maps match results onto query sequences by filtering overlapping hits.
    • Keeps the match results with E-value < 10-10 and length >20
    • Chains the remaining matches by dynamic programming.
  • GROUPER
    • Gather similar sequences into groups
measures
Measures

For each nucleotide,

  • TP: correctly annotated as belonging to a TE
  • FP: falsely predicted as belonging to a TE
  • TN: correctly annotated as not belonging to a TE
  • FN: falsely predicted as not belonging to a TE
method for the annotation of known te families
Method for the Annotation of known TE families
  • BLASTER using BLASTN and MATCHER (BLRn)
  • RepeatMasker (RM)
  • RepeatMasker with MATCHER (RMm)
method for the annotation of known te families1
Method for the Annotation of known TE families
  • BLASTER using BLASTN and MATCHER (BLRn)
  • RepeatMasker (RM)
  • RepeatMasker with MATCHER (RMm)
  • RepeatMasker-BLASTER (RMBLR) : combined hits from both BLRn and RM and give them to MATCHER
method for the annotation of anonymous te families
Method for the Annotation of anonymous TE families
  • all-by-all comparison with BLASTER using BLASTN, MATCHER, and GROUPER
  • RECON
  • BLASTER using TBLASTX and MATCHER
  • HMM
what they we learned
What they (we) learned
  • Overall, BLRn outperforms RM with respect to the precise determination of TE boundaries.
  • RM is more sensitive for the detection of small and divergent TE.
  • The difference between BLRn and RM make them complementary for TE annotation.
  • A combined-evidence framework can improve the quality and confidence of TE annotation.
pipeline structure
Pipeline structure
  • TE detection software : BLASTER, RepeatMasker, TE-HMM, and RECON
  • Tandem repeat detection software : RepeatMasker, Tandem Repeat Finder (TRF), Mreps
  • Database: MySQL
  • Open Portable Batch System
  • Whole genomic sequence was segmented into chucks of 200kb overlapping by 10kb.
  • The results from different tool were stored in the database.
  • XML file is generated from the stored results and loaded into the Apollo genome annotation tool.