1 / 64

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Hybrid error correction and de novo assembly of single-molecule sequencing reads. Presented by George Roberts III.

felix
Download Presentation

Hybrid error correction and de novo assembly of single-molecule sequencing reads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid error correction and de novo assembly of single-molecule sequencing reads Presented by George Roberts III Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, GaneshkumarGanapathy, ZhongWang, David A Rasko, W Richard McCombie, Erich D Jarvis & Adam M Phillippy nature biotechnology NATURE BIOTECHNOLOGYVOLUME 30 NUMBER 7 JULY 2012

  2. Human Language • Double articulation • Complex expressions can be broken down into morphemes and words • Source code can be tokenized

  3. Vocalization in Chimpanzees • Limited vocalization • Pant hoot (excitement) • Food anticipation • Males, females – most common in α-males • Difficult and expensive to study http://www.cjclandandseaphoto.com/etanzania11.htm

  4. Cast of Characters • Erich Jarvis • Duke Neuroscientist • Uses the zebra finch and budgie as a “simple” model of vocalization • Birds are small and easy to breed • Sergey Koren • Celera Assembler • AMOS, metAMOS: assembly • pacBioToCA: (correction and assembly pipeline) • Andy Phillipy • Assemblathon

  5. A Genomic Panorama Phage λEscherichia coli Budgerigar Yeast Bob Duda, University of Pittsburgh Dreamtime Dennis Kunkel Britannica.com wikipedia Corn (RNA seq) Zebra Finch (RNA seq)

  6. Parakeets • Small to medium sized parrots (order psittaformes) • One of few vocal species • Crows (corvidae) also intelligent • Cavity nesters • Southern hemisphere & tropics Macaw (not a parakeet) Image credit: Luc Viatour

  7. Melopsittacusundulatus - Budgerigar • undulatus [L.] wavy pattern • Native to Australia • Little Sexual Dimorphism • Both parents care for young • Mating pairs allopreen • Males have blue ceres • 1.23GB (www.genomesize.com) • 2.8 GB assembly??? (Pre!Ensembl – Jarvis) – database error!

  8. Taeniopygiagutatta – Zebra Finch - Passeriformes • Teanio [L.] means striped, guttata [L.] means spotted or dappled • Jarvis Lab intramural volleyball team – TeanyPyggies • Introduced to Portugal, Puerto Rico, Brazil, US. • Sons learn their fathers songs with little variation (females do not sing) • Songs may change during puberty, but are locked in place thereafter • 1.2Gb Sanger assembly (Warren et al. 2010) • (Warren, Clayton, Ellgren & Arnold + Jarvis Mardis)

  9. PBcR Read Correction and Assembly • Resolve repeats through careful alignment • Eliminate spurious mappings (white) • Use top alignments to correcterrors in PacBio reads • Errors remain where short reads have same error as PacBio reads

  10. Longer Read Length Improves Assembly • Simulated data based on: • Even coverage and average read lengths from actual data • Error correction rate of 99.9% (76-bp reads) • 10x PacBio coverage produces optimal assembly

  11. Illumina Paired-End Sequencing • Inserts of 200-500bp • Sequence with SP1 • Sequence with SP2

  12. PacBioCorrected Accuracy

  13. Circular Consensus Sequence (CCS) • Read length = 1 / Coverage • Makes use of 29 rolling circle http://smrt.med.cornell.edu/

  14. Contig Size vs. Sequencing Technology

  15. Error Correction of mRNA-SeqImproves Mapping

  16. PacBio Error Rate is Position Independent S288C

  17. S288C Coverage By Chromosome I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI mito. • Top 10 alignments > 1kb were mapped with BLASR • Depth tallied by 1kb bins • Claim: spikes caused by mapping artifacts • PacBioremoves amplification bias and reduces G+C bias.

  18. Sequencing Depth Histogram • Poisson λ = 12.5 • Fatter tails • Bias? • mapping artifact?

  19. Assembler Performance on Uncorrected PacBio

  20. k-mer Size to detect Overlaps: E. coli Simulated Uncorrected PacBio

  21. CDF of Correct Overlaps vs. Mismatch Tolerance Cumulative % Overlaps E = 0.16 + 0.16 – 0.162 ~= 31.55% % Overlap Error

  22. 454 vs. PacBio Overlaps Cumulative % Overlaps % Overlap Error

  23. FLX+ • 1kb reads (700bp mode) • Consensus accuracy 99.997% (15x coverage) • 1,000,000 reads per run • GS FLX Titanium chemistry

  24. Sequence Data Used to Test the Pipeline

  25. Correction Algorithm Scales Linearly With Input SIze

  26. Illumina Coverage vs. N50 • 200x Datapoint: • random Illumina errors are common enough to align with PacBio errors • 4.86% drop in uncorrected N50 corresponds to a 20% drop in corrected N50 • Not recommended! • Aggressive trimming (Quake) reduces the chimera rate to 1.86%, eliminating the drop • % Chimera increases with coverage “Sweet spot”

  27. Read Length, Coverage and Identity

  28. Perspective in the Search for Truth

  29. Contiguity is Correlated with Read Length Low complexity N50 normalized to Genome Size Low coverage Average Read length

  30. PBcR Resolution of Repeats

  31. PacBio Coverage vs. Correction Methods • De Bruijn thrives on high coverage, OLC can be hindered by high coverage

  32. Melopsittacusundulatus assembly • A hybrid assembly of the 454 and Illumina data was not possible because Celera Assembler does not support high-coverage Illumina data and ALLPATHS-LG does not support 454. • ALLPATHS-LG assembles smaller contigs but scaffolds contain additional 1-2% of transcript bases (makes excellent use of short reads)

  33. Melopsittacusundulatus assembly • 40% of [zebra finch] transcripts in the unstimulated auditory forebrain are noncoding and derive from intronic or intergenicloci • 92% of 454-PBcR-Illumina closed gaps are outside of coding regions • 18% within introns • 74% between “gene models”

  34. k-mer uniqueness in six genomes

  35. Sequence is from opposite strands and in opposite directions

  36. 454-Corrected PacBio Assembled by 10kb Illumina mate pairs

  37. Illumina-Corrected PacBio Assembled by 10kb Illumina mate pairs

  38. PBcR Join Lengths agree with Scaffold Estimates • 33,881 scaffold gaps • 16,251 (48%) closed by 454-PBcR • 17,290 (51%) closed by 454-PBcR-Illumina • 11,804 (35%) closed by both • Half not closed by either!

  39. TaeniopygiaguttatamRNA-CDS Mapping • 15,275 zebra finch mRNA from NCBI • 81, 83, 86 and 85 hybrid mappings respectively

  40. mRNA-CDS Mapping - Tabular

  41. Assembly – Gap Statistics • Vast majority of gaps are outside of exons

  42. Avian Vocalization Regions Area X: Basal Ganglia RA: premotor nucleus NXIIts: hypoglossal HVC: high vocal center DLM: dorsolateral division of the medial thalamus LMAN: lateral part magnocellular nucleus

  43. Genomics of Vocalization • Large involvement of ncRNA (Mattick 2004, Warren 2010)

  44. Forkhead Box P2 - FOXP2 • DNA-binding protein • Poly [Q] (activation) • Zinc-finger (DNA-Binding) • Leucine-zipper (dimerization) • Required for proper brain and lung development

  45. Forkhead Box P2 - FOXP2 • Knockout mice pups exhibit less vocalization • Abnormalities in Purkinje layer • Death ~ 21 days (lung development) • 400kb • Bat echolocation • Extremely diverse (conserved in all other mammals) • Upregulated in T. guttata vocalization regions • Mutations in human cause severe speech disorders despite adequate intelligence • Underactivation of Putamen & Broca’s area http://vanat.cvm.umn.edu/neurHistAtls/pages/cns9.html

  46. Human Speech Areas

  47. FOXP2 Human - Chimp • N303 and S325 • N303 unique to humans • Relatively few intronic mutations (recent sweep) • Zhang et al. 2002 Genetics 162:1825-1835 positive selection

  48. FOXP2 • Knockdown impairs song imitation • Sequence differences do not affect learning • Canaries relearn their songs each year • Order Passeriformes (finches) • FoxP2 levels increase in late summer and early fall Fisher and Scharf (2009) TIGS 25:166-177

  49. Zebra Finch Genome - 2010 • 2nd avian genome (after chicken) • Erich Jarvis, Elaine Mardis (Warren 2010) Singing-correlated gene expression

More Related