1 / 49

Computational Genomics: Genome assembly

Computational Genomics: Genome assembly. Andrey Kislyuk 25 January 2010. Why do we need to assemble genomes?. DNA sequencing methods can’t sequence more than about 1000 nt at a time Sanger method (1975) chain termination with labeled ddNTPs Maxam-Gilbert method (1976) cleaving agents

erv
Download Presentation

Computational Genomics: Genome assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Genomics:Genome assembly Andrey Kislyuk 25 January 2010

  2. Why do we need to assemble genomes? • DNA sequencing methods can’t sequence more than about 1000 nt at a time • Sanger method (1975) • chain termination with labeled ddNTPs • Maxam-Gilbert method (1976) • cleaving agents • Both require primers for DNA Polymerase, but we can’t make the primers until we know the sequence! • Both limited to ~1000 nt on the gel/capillary • Both require cloning and/or PCR amplification • Large-scale sequencing: primer walking • Slow, costly, and error-prone: not practical beyond ~10Kbp 5 June 2014·Computational Genomics

  3. Shotgun sequencing • Whole genome shotgun sequencing (1995) • Hydroshearing: prepare several libraries of random fragments approx. 2, 5, 10, 50… Kbp long • Cloning: use bacterial plasmids to grow DNA – problems arise if DNA contains a gene harmful to the host bacteria • Picking, amplification • Sanger sequencing, capillary electrophoresis, read out fluorescent dyes with a laser – 4 different colors • Result: lots of ~1000 nt Sanger reads • Assemble them with pairwise sequence alignment • Multiple coverage corrects errors • Seems straightforward now, but many did not believe it could be done! 5 June 2014·Computational Genomics

  4. How much shotgun sequencing? • So, can we really sequence the whole genome with this? (No, we can’t.) • Lander and Waterman (1988): • Assuming random distribution of reads and ignoring repeat resolution issues, Define: • G = genome length • L = length of a single read • Then overall coverage is C = LN/G • N = number of reads sequenced • T = minimum overlap to align the reads together • Coverage for any given base obeys the Poisson distribution: • The number of gaps (bases with 0 coverage) is: 5 June 2014·Computational Genomics

  5. How much shotgun sequencing? 5 June 2014·Computational Genomics

  6. How much shotgun sequencing? 5 June 2014·Computational Genomics

  7. Pioneers of sequence assembly • J. Craig Venter’s group at TIGR (later JCVI) • Created TIGR Assembler, Celera Assembler, and associated tools • Jim Kent (UC Santa Cruz) • Created GigAssembler • Allowed the Human Genome Project to compete with Celera • Philip Green’s group at the University of Washington • Created Phred, Phrap and Consed tools • Sequencing centers: JCVI, Sanger Institute, Whitehead/MIT, DOE JGI, Baylor HGSC, WUSTL 5 June 2014·Computational Genomics

  8. Why do we need to assemble genomes? • 2nd Generation sequencing methods • Cheaper and more processive (sequence more data), but shorter read length • 454 Pyrosequencing: 200-600 nt average read length • Illumina: 50-70 nt average read length • ABI SOLiD: 50 nt average read length • Same idea: randomly hydrosheared library • Random reads from across the genome form a big puzzle 5 June 2014·Computational Genomics

  9. Next Generation Sequencing Technologies Sequencing by synthesis • 2nd generation • 454 Pyrosequencing • Solexa/Illumina • SOLiD • 3rd generation • Single-molecule sequencing • Nanopore sequencing 5 June 2014·Computational Genomics

  10. 454 Pyrosequencing A + PCR Reagents + Emulsion Oil B Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR 5 June 2014·Computational Genomics

  11. 454 Pyrosequencing 44 μm Load enzyme beads Load beads into PicoTiter™Plate PicoTiter™Plate Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm-2 1.6 million wells per slide 5 June 2014·Computational Genomics

  12. 454 Pyrosequencing Sequencing by synthesis Photonsgenerated are captured by CCD camera Reagent flow Margulies et al., 2005

  13. Raw sequencer output 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer • Sanger: Trace (usually .ab1/.scf file format) • 454: Flowgram (.sff file format) 5 June 2014·Computational Genomics

  14. Assembly algorithms • Paradigms • Overlap-Layout-Consensus • De Bruijn graphs 5 June 2014·Computational Genomics

  15. Differences between an overlap graph and a de Bruijn graph Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press

  16. Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Credit: Serafim Batzoglou

  17. Overlap: A pairwise alignment problem • Find the best match between the suffix of one read and the prefix of another • Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment • Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Credit: Serafim Batzoglou

  18. Overlapping Reads T GA TACA | || || TAGA TAGT • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >95% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC Credit: Serafim Batzoglou

  19. Overlapping Reads and Repeats • A k-mer that appears N times initiates N2 comparisons • For an Alu that appears 106 times  1012 comparisons – too much • Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10) Credit: Serafim Batzoglou

  20. Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Credit: Serafim Batzoglou, Masahiro Kasahara

  21. Finding Overlapping Reads (cont’d) • Correcterrors using multiple alignment C: 20 C: 20 C: 35 C: 35 C: 0 T: 30 C: 35 C: 35 TAGATTACACAGATTACTGA C: 40 C: 40 TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA A: 15 A: 15 A: 25 A: 25 - A: 0 A: 40 A: 40 A: 25 A: 25 • Score alignments • Accept alignments with good scores Credit: Serafim Batzoglou

  22. Layout • Repeats are a major challenge • Do two aligned fragments really overlap, or are they from two copies of a repeat? Credit: Serafim Batzoglou

  23. The k-mer uniqueness ratio Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press

  24. Merge Reads into Contigs repeat region Merge reads up to potential repeat boundaries Credit: Serafim Batzoglou

  25. Merge Reads into Contigs (cont’d) repeat region • Ignore non-maximal reads • Merge only maximal reads into contigs Credit: Serafim Batzoglou

  26. Merge Reads into Contigs (cont’d) repeat boundary??? • Ignore “hanging” reads when detecting repeat boundaries sequencing error b a Credit: Serafim Batzoglou

  27. Merge Reads into Contigs (cont’d) ????? Unambiguous • Insert non-maximal reads whenever unambiguous Credit: Serafim Batzoglou

  28. Link Contigs into Supercontigs (aka scaffolds) Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Credit: Serafim Batzoglou

  29. Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if  2 links Credit: Serafim Batzoglou

  30. Link Contigs into Supercontigs (cont’d) Fill gaps in supercontigs with paths of overcollapsed contigs Credit: Serafim Batzoglou

  31. Link Contigs into Supercontigs (cont’d) d ( A, B ) Contig A Contig B • Define G = ( V, E ) • V := contigs • E := ( A, B ) such that d( A, B ) < C • Reason to do so: Efficiency; full shortest paths cannot be computed Credit: Serafim Batzoglou

  32. Link Contigs into Supercontigs (cont’d) Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T Credit: Serafim Batzoglou

  33. Consensus • A consensus sequence is derived from a profile of the assembled fragments • A sufficient number of reads is required to ensure a statistically significant consensus • Reading errors are corrected Credit: Serafim Batzoglou

  34. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting Credit: Serafim Batzoglou

  35. Mate pairs and paired-end reads • Mate pairs: Circularize and trim size-selected fragments during library preparation. Inserts can be approx. 1, 5, 10, 20 Kbp long. • Paired-end reads: Sequence a short amplified fragment from both ends. Fragment length is more precise but limited to about 300 bp. 5 June 2014·Computational Genomics

  36. Mate pairs/Paired-end reads 5 June 2014·Computational Genomics

  37. Paired end reads (aka mate pairs) 5 June 2014·Computational Genomics Credit: 454 Life Sciences

  38. Base Calling and Trimming • Base Calling: the process of translating the raw sequencer output into • The most likely nucleotide sequence • Confidence scores for each position • Trimming: the process of removing adapter, key, vector, and/or low quality sequence from a read 5 June 2014·Computational Genomics

  39. Reference-based assembly • Reference-based assembly • Replaces overlap detection with alignment against a similar genome • Also called mapping, mapped assembly Credit: M. Schatz 5 June 2014·Computational Genomics

  40. Reference-based assembly • Use a related genome to ease the layout task • Much faster computationally • Arranges reads with more confidence, so a better assembly is possible • Allows other types of analysis: somatic mutations, organismal SNPs, structural variation, RNA-Seq, … 5 June 2014·Computational Genomics

  41. Assembly quality control • QC/QA • Metrics: Size, number of contigs, N50 • Diagnostic procedures 5 June 2014·Computational Genomics

  42. Genome size as predicted from the assembly 5 June 2014·Computational Genomics

  43. Read length, paired-end reads, coverage • Read length and paired-end reads matter. • Long reads can span repeat regions • Paired-end reads can reach into repeat regions and bridge gaps • Combination of the two maximizes shotgun sequencing performance • Coverage also matters. • High coverage allows very high confidence in base calling • Can do repeat resolution based on coverage fitting • More likely that a read will span an ambiguous region 5 June 2014·Computational Genomics

  44. Scaffolding • If paired end reads are available, scaffolding is already done. • If not (our case)… • Sequenced relatives may exist (our case) • Use reference-based assembly to predict scaffolding • No ready-made tools available for this • Can be inaccurate • Assemblers can get confused by repeats or overlaps that are too short • May be able to join by hand • Manual gap fill • Automated gap fill (no tools exist yet) 5 June 2014·Computational Genomics

  45. Finishing • Finishing is the process of completing the chromosome sequence. • Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) • Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) – same procedure as gaps • Check and manually assemble unresolved repeat regions • Check for mis-assembly by analyzing the overlap graph • Lots of Consed work! • This is the most expensive and time-consuming part of sequencing. • Lots of small projects omit finishing and work with draft genomes 5 June 2014·Computational Genomics

  46. Assemblers we used and our results 5 June 2014·Computational Genomics

  47. Homopolymer errors 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer • Specific to 454 pyrosequencing • Sequencing errors usually result in frameshifts! 5 June 2014·Computational Genomics

  48. Visualization tools: Mauve 5 June 2014·Computational Genomics

  49. More topics • Currently popular assemblers • Newbler (demo) • Velvet • ALLPATHS 2 • ABYSS • SHRiMP • Celera/WGS • PHRAP • Other visualization tools (Consed, MAQ, Prospector 2, ABySS-Explorer…) • Microread assembly (Solexa and SOLiD) • de Bruijn graph assembly paradigm 5 June 2014·Computational Genomics

More Related