1 / 38

From calling bases to calling variants: Experiences with Illumina data

From calling bases to calling variants: Experiences with Illumina data. Gerton Lunter Wellcome Trust Centre for Human Genetics. This talk. Refresher : Illumina sequencing QC What can go wrong Useful QC statistics Read mapping Comparison of popular read mappers Stampy

ferrol
Download Presentation

From calling bases to calling variants: Experiences with Illumina data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From calling bases to calling variants:Experiences with Illumina data Gerton Lunter Wellcome Trust Centre for Human Genetics

  2. This talk • Refresher: Illumina sequencing • QC • What can go wrong • Useful QC statistics • Read mapping • Comparison of popular read mappers • Stampy • Indel and SNP calling • (Some results: 1000 Genomes indel calls)

  3. 7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

  4. 1. Refresher: Illumina sequencing

  5. Illumina sequencing

  6. Illumina sequencing • 8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

  7. 2. QC

  8. Quality issues • Bases are identified by their fluorescent tag • Overlapping emission spectra • Single base per cycle: reversible terminator chemistry • Not perfect: fraction lags, fraction runs ahead: “Phasing” • Limits read length • Optimizing yield: cluster density • Higher densities mean more errors • Above an optimum, yield decreases • Partly signal processing issue: software improvements • Low amounts of initial DNA • Linker-linker hybrids; duplicated reads

  9. Overlapping fluorescence spectra • C/A and G/T overlap • (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

  10. Refresher: Phred scores • Phred score = 10 log10( probability of error ) • 10: 10% error probability • 20: 1% error probability • 30: 0.1% error probability (one in 1,000) • 3 = 50%, 7 = 20% • 13 = 5%, 17 = 2% • 23 = 0.5%, 27 = 0.2% • 33 = 0.05%, 37 = 0.02%

  11. Phasing June 2010 August 2009

  12. Cluster density & other improvements August 2009: June 2010:

  13. Library complexity, duplicate reads • Some sequences are read several times: • Low amount of initial material, many PCR copies • Optical duplicates; secondary cluster seeding • Problem for variant calling • Any PCR error will be seen twice: evidence for variant • Rate of duplicates is rarely >5% • Criterion: both ends of a PE read map to matching location • Can occur by chance, but low probability, except for very high coverage • Post processing: duplicate removal • Standard processing step (e.g. Samtools, Picard) • Useful statistic: • Duplicate fraction is approximately additive across lanes (same library) • 2x duplication fraction ≈ fraction of the library that was sequenced

  14. Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson(α) Expected fraction of duplicates: e-α-1+α As a fraction of all reads sequenced: (e-α-1+α)/α = ½ α + …

  15. Sequencing QC

  16. QC statistics

  17. QC statistics - coverage

  18. QC statistics – quality scores

  19. GATK recalibration tool

  20. 3. Read mapping

  21. Read mapping • First processing step after sequencing: • Read mapping (most times) • Assembly (no reference sequence; specialized analyses) • Quality of mapping determines downstream results • Accessible genome • Biases (ref vs. variant) • Sensitivity (divergent reference; snps, indels, SV) • Specificity (calibration of mapping quality)

  22. Read mapper comparison • Read mappers: • Maq • BWA • Eland • Novoalign • Stampy • Criteria: • Sensitivity (overall; divergent reference; variants) • Specificity (mapping quality calibration) • Speed

  23. Sensitivity

  24. Sensitivity - indels

  25. Sensitivity – Divergent reference

  26. Specificity – ROC curves ROC - indels

  27. Performance on real data Proportion mapped to within 10kb of mate

  28. Efficiency

  29. Stampy – first part of algorithm read 4 bytes x 229 entry (2 Gb) hash table 15 bp subsequence candidate positions Remove rev-comp symmetry 29 bit word open addressing,cache-friendly

  30. Second part: Fast candidate alignment Single-instruction-multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp.

  31. Third part: Modeling mapping failures • Pseudo-bayesian posterior (using candidates, rather than all mapping positions) • Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence) • Sequence not in reference (is sequence match better than expected best random match?)

  32. 4. SNP and indel calling

  33. SNP calling • General idea: • Works quite well! Some caveats: • Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map) • Mapping errors are dependent: don’t include mapQ<10 • Base errors are not uniform (A/C/G/T): assume worst case (all identical) • Assumes no anomalies (segdups; alignments; indel/SV; …) • Hard problem: be conservative • Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1% FDR • Filtering is required to achieve good FDR – or all data features must be adequately modeled

  34. Indel calling • General idea: • Differences with SNP calling: • Pseudo-Bayes: cannot consider all possible variants/genotypesGenerate large set of candidatesFilter using goodness-of-fit test • Illumina reads do not have an explicit indel error model

  35. Indel error model Homopolymer run length

  36. Wrap up • GA-II produces large amounts of good data • Artefacts do occur, keep a look at QC statistics • Choice of mapper influences yield and quality • Variant calling: • Bayesian approaches work well • Some assumptions (independence) not met, hard to model • Filtering remains necessary

More Related