Download
varid a variation detection framework for color space and letter space platforms n.
Skip this Video
Loading SlideShow in 5 Seconds..
VARiD: A Variation Detection Framework for Color-space and Letter-space platforms PowerPoint Presentation
Download Presentation
VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

128 Views Download Presentation
Download Presentation

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. VARiD: A Variation Detection Framework for Color-space and Letter-space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev

  2. VARiD Overview • Purpose: Variation Detection (SNP, indel) • Pitch: First to use both colour-space and letter-space data • Principle: Hidden Markov Model with Forward-Backward algorithm • Platform: 454/Roche, Solexa, ABI SOLiD • Pros: Can work with unconverted sets of both formats simultaneously • Performance: linear in length of reference, great on mixed format data

  3. ABI SOLiD Basics • Reads bases two at a time • Outputs one of four colours based on transition state machine:

  4. ABI SOLiD Properties Read errors and SNPs present differently. Reference:

  5. ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error:

  6. ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error: SNP:

  7. ABI SOLiD Properties A read error propagates through the rest of the sequence on translation to letter-space

  8. Consequences Colour-space encoding is better suited to calling SNPs than letter-space encoding In letter-space data, errors do not propagate through to the rest of the read Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?

  9. VARiD • A Hidden Markov Model for Variation Detection In general, HMM’s have the following elements: • States (hidden) • Transitions (probabilities of reaching any particular state from the previous one) • Emissions (observed outputs)

  10. Building a Basic HMM States: pairs of consecutive letter-space positions: S = {AA, AT, AC, AG TT, TA, TC, TG CC, CA, CT, CG GG, GA, GT, GC}

  11. Building a Basic HMM Transitions: since consecutive states share a nucleotide, probabilities are defined as follows: P(transition WX  YZ) = frequency(Z) if X=Y 0 if X≠Y

  12. Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = c|state = CA) = q(c|CA) = 1 – 3ε if c is 1 ε if c is 0, 2, 3 for colour space

  13. Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = n|state = CA) = q(n|CA) = 1 – 3ξ if n is A ξ if n is C, G, T for letter space

  14. Building a Basic HMM Emission probabilities from all reads: P(emissions = E|state = s) = which combines colour and letter space data

  15. Building a Basic HMM Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.

  16. Building a Basic HMM By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold). Source: Dalca, A. & Brudno, M. (Poster)

  17. Extensions The HMM described above is quite simple and only calls a single nucleotide for each position. VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.

  18. Microindels To deal with microindels (<5 bp) in the sample, gap states are required: E.g. [A - - - G] (would emit colour 2) • 4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T • [A - - - G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)} Colour 2

  19. Microindels Requires 24 more states: • (X, gapX) x 4 • (gapX, gapX) x 4 • (gapX,Y) x16 • Total (incl. orig.) 40 states

  20. Heterozygous SNPs For diploid samples, each state has to account for heterozygous differences Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S) 402 = 1600 states!

  21. Features Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy Post-processing ensures that all heterozygous SNP calls are supported by enough reads

  22. Features Source: Original paper

  23. Features First T in a read is NOT part of the sequence.

  24. Features First T is NOT part of the genotype! VARiD eliminates linker remnant without having to translate fully

  25. VALiDation 260kb from the human genome Sequenced with ABI SOLiD and 454/Roche Reference obtained through Sanger reads Artificial datasets created with varying amounts of coverage Tested in colour-space alone (against Corona), letter-space alone (against gigaBayes) with various aligners and with a combination of data

  26. VALiDation Measures: True Positives (correctly identified SNPs) False Positives (SNPs not in Sanger set) Precision (TP as fraction of all predictions) Recall (TP as fraction of Sanger set SNPs)

  27. VALiDation Colour space only In colour space, VARiD had slightly higher precision than the Corona caller on AB-mapped reads, but had comparable and slightly lower recall. Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper. (no significance statistics were presented)

  28. VALiDation Letter Space Only In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage. VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage

  29. VALiDation Mixed space VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost-equivalent letter-only or colour-only data:

  30. Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.)

  31. Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.) Any more?

  32. The End.

  33. References • Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A Variation Detection Framework for Color-space and Letter-space platforms. 2010 (in progress) • Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color-space and Letter-space (poster) • Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from http://en.wikipedia.org/w/index.php?title=Hidden_Markov_model&oldid=341442380 • Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.