1 / 28

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. Euler Path and Sequence Reconstruction Euler Hash Table Read Extension Using Base Qualities and Read Pairs

mercer
Download Presentation

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • Euler Path and Sequence Reconstruction • Euler Hash Table • Read Extension • Using Base Qualities and Read Pairs • Repeat Junctions and Single Base Variation • Assembly Results • Future Work

  3. Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

  4. CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

  5. E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

  6. E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 Point to the Next - Hash Table Links S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

  7. Repeat Repeat Repeat Sequence Repeat Graph reads

  8. Extend Solexa reads to long reads of 1-2 Kb forward-reverse paired reads known dist ~500 bp 30-40 bp 30-40 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome Assembly Strategy

  9. Kmer Extension & Walk

  10. Quality Filters on Junctions

  11. True Repeat Junctions

  12. All Low Base Quality Case

  13. Depth Depth Pair read position Current read position Contig start Insert length Repetitive Contig and Read Pairs For each hit read in the contig, contig index and offset are stored.

  14. Read Pairs to Resolve Repeat Junctions

  15. Handling of Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

  16. Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1

  17. S Suis P1/7 Solexa Assembly Solexa reads: Number of reads: 3,084,185;Finished genome size: 2,007,491 bp; Read length: 39 and 36 bp; Estimated read coverage: ~40X; Estimated Kmer coverage: 14X; Number of vector reads: ?; Assembly features: - contig statsTotal number of contigs: 362; Total bases of contigs: 1,938,732 bp N50 contig size: 10,849; Largest contig: 33,388 Averaged contig size: 5,356; Contig coverage over the genome: ~97 %; Contig extension errors: 1 Mis-assembly errors: 3

  18. S Suis P1/7 Shredded Read Assembly Shredded reads: Number of reads: 1,338,161;Finished genome size: 2,007,491 bp; Read length: 36; Estimated read coverage: 24X;Insert size: 500 bp; Assembly features: Paired_Data Not_Paired Number of contigs: 35 317 Total assembled bases: 1.996 Mb 1.956 Mb N50 contig size: 243,039 13,929 Largest contig: 474,070 33,460 Averaged contig size: 57,043 6,168 Contig coverage: >99.0 % >99.0 % Contig extension errors: 0 0 Mis-assembly errors: 3 2

  19. STyphi 6979 Solexa Assembly Solexa reads: Number of reads: 5,142,190;Finished genome size: 4,809,037 bp; Read length: 41; Estimated read coverage: ~15X;Assembly features: - contig statsTotal number of contigs: 3,126; Total bases of contigs: 4,633,241 bp N50 contig size: 2,460; Largest contig: 15,325; Averaged contig size: 1,482; Contig coverage over the genome: ~97.5 %; Mis-assembly errors: 0

  20. STyphi CT18 Shredded Read Assembly Solexa reads: Number of reads: 4,808,788;Finished genome size: 4,809,037 bp; Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 65; Total bases of contigs: 4,800,992 bp N50 contig size: 158,460; Largest contig: 489,849; Averaged contig size: 73,861; Contig coverage over the genome: ~99.0 %; Mis-assembly errors: 3

  21. PF_3D7 Shredded Read Assembly Solexa reads: Number of reads: 11,630,428;Finished genome size: 23.5 Mp; Read length: 40; Estimated read coverage: 20X; Assembly features: - contig statsTotal number of contigs: 29,313; Total bases of contigs: 17.17 Mp N50 contig size: 1,355; Largest contig: 14,136; Averaged contig size: 585; Contig coverage over the genome: ~72.8 %; Mis-assembly errors: ?

  22. Shred reads with given coverage forward-reverse paired reads known dist ~500 bp ~40 bp ~40 bp Organize reads into small groups covering clone 200 kb Clone Level Assembly with Shredded Error Free Reads Genome/Chromosome

  23. Human Chromosome X Shredded reads: Number of reads: 156 million Chromosome length: 156 Mb Number of Clones: 774 Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 28,204; Total bases of contigs: 148 Mp N50 contig size: 30,968; Largest contig: 173,157; Averaged contig size: 5,254;

  24. Zebrafish Chromosome 5 Shredded reads: Number of reads: 70.2 million Chromosome length: 70.3 Mb Number of Clones: 351 Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 22,405; Total bases of contigs: 67.5 Mp N50 contig size: 9,587; Largest contig: 70,757; Averaged contig size: 3,012;

  25. Plasmodium Chr14 Shredded reads: Number of reads: 3.2 million Chromosome length: 3.29 Mb Number of Clones: 16 Read length: 40; Estimated read coverage: 40X; Assembly features: - Original dataTotal number of contigs: 1,960; Total bases of contigs: 2.86 Mp N50 contig size: 2,924; Largest contig: 18,366; Averaged contig size: 1,461; Assembly features: - Replacing “TATATA…”Total number of contigs: 1,333; Total bases of contigs: 3.05 Mp N50 contig size: 4,596; Largest contig: 23,345; Averaged contig size: 2,287;

  26. Acknowledgements: • Ian Goodhead and Chris Clee • James Bonfield • Yong Gu and Adam Spargo • Daniel Zerbino (EBI) • Tony Cox • Richard Durbin

More Related