1 / 74

Next-generation sequencing: informatics & software aspects

Next-generation sequencing: informatics & software aspects. Gabor T. Marth Boston College Biology Department. Next-gen data. Read length. 20-60 (variable). 25-50 (fixed). 25-70 (fixed). ~200-450 (variable). 400. 100. 200. 300. 0. read length [bp]. Paired fragment-end reads.

neve-hoover
Download Presentation

Next-generation sequencing: informatics & software aspects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department

  2. Next-gen data

  3. Read length 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

  4. Paired fragment-end reads • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) • instrumental for structural variation discovery

  5. Representational biases “dispersed” coverage distribution • this affects genome resequencing (deeper starting read coverage is needed) • will have major impact is on counting applications

  6. Amplification errors early amplification error gets propagated into every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

  7. Read quality

  8. Error rate (Solexa)

  9. Error rate (454)

  10. Per-read errors (Solexa)

  11. Per read errors (454)

  12. Applications

  13. Genome resequencing for variation discovery SNPs short INDELs structural variations • the most immediate application area

  14. Genome resequencing for mutational profiling Organismal reference sequence • likely to change “classical genetics” and mutational analysis

  15. De novo genome sequencing Lander et al. Nature 2001 • difficult problem with short reads • promising, especially as reads get longer

  16. Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) • natural applications for next-gen. sequencers

  17. Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 • high-throughput, but short reads pose challenges

  18. Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 • high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

  19. Analysis software(resequencing)

  20. IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation Individual resequencing REF

  21. The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

  22. diverse chemistry & sequencing error profiles 1. Base calling base sequence base quality (Q-value) sequence

  23. 454 pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs

  24. 454 base quality values • the native 454 base caller assigns too low base quality values

  25. PYROBAYES: determine base number

  26. PYROBAYES: Performance • better correlation between assigned and measured quality values • higher fraction of high-quality bases

  27. Base quality value calibration Raw Illumina reads (1000G data)

  28. Recalibrated base quality values (Illumina) Recalicrated Illumina reads (1000G data)

  29. … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…

  30. Non-uniqueness of reads confounds mapping • RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin

  31. Strategies to deal with non-unique mapping 0.8 0.19 0.01 • mapping to multiple loci requires the assignment of alignment probabilities (mapping qualities) read • Non-unique read mapping: optionally eitheronly report uniquely mapped readsorreport all map locations for each read (mapping quality values for all mapped reads are being implemented)

  32. Longer reads are easier to map 454 FLX (1000G data)

  33. Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • PE reads are now the standard for genome resequencing

  34. MOSAIK

  35. INDEL alleles/errors – gapped alignments 454

  36. Aligning multiple read types together ABI/capillary 454 FLX • Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics 454 GS20 Illumina

  37. Aligner speed

  38. sequencing error polymorphism 3. Polymorphism / mutation detection

  39. Allele calling in “trad” sequences • capillary sequences: • either clonal • or diploid traces

  40. Allele calling in next-gen data SNP New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

  41. Human genome polymorphism projects common SNPs

  42. Human genome polymorphism discovery

  43. The 1000 Genomes Project

  44. deep alignments of 100s / 1000s of individuals • trio sequences New challenges for SNP calling

  45. Rare alleles in 100s / 1,000s of samples

  46. Allele discovery is a multi-step sampling process Allele detection Samples Reads Population

  47. Capturing the allele in the sample

  48. Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

  49. Allele calling in the reads sample size individual read coverage base call base quality

  50. More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan

More Related