1 / 24

Fuzzypath – Algorithms, Applications and Future Developments

Fuzzypath – Algorithms, Applications and Future Developments. Zemin Ning Sequence Assembly and Analysis. Outline of the Talk:. Sequence Reconstruction and Euler Path Assembly strategy Sequence extension using read pairs, base qualities, fuzzy kmers or longer reads Repeat junctions

zoe
Download Presentation

Fuzzypath – Algorithms, Applications and Future Developments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis

  2. Outline of the Talk: • Sequence Reconstruction and Euler Path • Assembly strategy • Sequence extension using read pairs, base qualities, fuzzy kmers or longer reads • Repeat junctions • Installation, data process and running • Gap5 - visual inspection for mis-assembly errors • Integration into the Phusion pipeline

  3. Repeat Repeat Repeat Sequence Repeat Graph Sequences

  4. Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

  5. CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

  6. Solexa read assembler to extend short reads to 1-2 kb long reads forward-reverse paired reads known dist ~500 bp 30-75 bp 30-75 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome Assembly Strategy

  7. Kmer Extension & Walk

  8. Base Quality to Filter Base Errors

  9. Read Pairs in Repeat Junctions

  10. Pileup of other reads like 454, Sanger etc at a repeat junction Kmer Extension & Repeat Junctions A2 A1 Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads

  11. Handling of Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

  12. Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1

  13. Fuzzypath Pipeline

  14. Fuzzypath Read File

  15. FuzzypathFastq File

  16. Salmonella seftenbergSolexa Assembly from Pair-End Reads Solexa reads: Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa 454Total number of contigs: 75; 390 Total bases of contigs: 4.80 Mbp 4.77 Mb N50 contig size: 139,353 25,702 Largest contig: 395,600 62,040 Averaged contig size: 63,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: 0 Mis-assembly errors: 0 4

  17. maq ssaha2

  18. maq ssaha2

  19. maq ssaha2

  20. maq ssaha2

  21. New Phusion Assembler Assembly Data Process Solexa Reads Supercontig Long Insert Reads PRono Contigs Reads Group Fuzzypath 2x75 or 2x100 Phrap Velvet

  22. Human Assembly – COLO-829 Normal Cell Solexa reads: Number of reads: 557 Million;Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 458 Million Assembly features: - contig statsTotal number of contigs: 1,040,582; Total bases of contigs: 2.703 Gb N50 contig size: 6,484; Largest contig: 85,595 Averaged contig size: 2,597; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

  23. Acknowledgements: • Yong Gu • James Bonfield • Heng Li • Hannes Ponstingl • Daniel Zerbino (EBI) • Helen Beasley • Siobhan Whitehead • Tony Cox

More Related