VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev

VARiD Overview • Purpose: Variation Detection (SNP, indel) • Pitch: First to use both colour-space and letter-space data • Principle: Hidden Markov Model with Forward-Backward algorithm • Platform: 454/Roche, Solexa, ABI SOLiD • Pros: Can work with unconverted sets of both formats simultaneously • Performance: linear in length of reference, great on mixed format data

ABI SOLiD Basics • Reads bases two at a time • Outputs one of four colours based on transition state machine:

ABI SOLiD Properties Read errors and SNPs present differently. Reference:

ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error:

ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error: SNP:

ABI SOLiD Properties A read error propagates through the rest of the sequence on translation to letter-space

Consequences Colour-space encoding is better suited to calling SNPs than letter-space encoding In letter-space data, errors do not propagate through to the rest of the read Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?

VARiD • A Hidden Markov Model for Variation Detection In general, HMM’s have the following elements: • States (hidden) • Transitions (probabilities of reaching any particular state from the previous one) • Emissions (observed outputs)

Building a Basic HMM States: pairs of consecutive letter-space positions: S = {AA, AT, AC, AG TT, TA, TC, TG CC, CA, CT, CG GG, GA, GT, GC}

Building a Basic HMM Transitions: since consecutive states share a nucleotide, probabilities are defined as follows: P(transition WX  YZ) = frequency(Z) if X=Y 0 if X≠Y

Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = c|state = CA) = q(c|CA) = 1 – 3ε if c is 1 ε if c is 0, 2, 3 for colour space

Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = n|state = CA) = q(n|CA) = 1 – 3ξ if n is A ξ if n is C, G, T for letter space

Building a Basic HMM Emission probabilities from all reads: P(emissions = E|state = s) = which combines colour and letter space data

Building a Basic HMM Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.

Building a Basic HMM By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold). Source: Dalca, A. & Brudno, M. (Poster)

Extensions The HMM described above is quite simple and only calls a single nucleotide for each position. VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.

Microindels To deal with microindels (<5 bp) in the sample, gap states are required: E.g. [A - - - G] (would emit colour 2) • 4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T • [A - - - G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)} Colour 2

Microindels Requires 24 more states: • (X, gapX) x 4 • (gapX, gapX) x 4 • (gapX,Y) x16 • Total (incl. orig.) 40 states

Heterozygous SNPs For diploid samples, each state has to account for heterozygous differences Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S) 402 = 1600 states!

Features Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy Post-processing ensures that all heterozygous SNP calls are supported by enough reads

Features Source: Original paper

Features First T in a read is NOT part of the sequence.

Features First T is NOT part of the genotype! VARiD eliminates linker remnant without having to translate fully

VALiDation 260kb from the human genome Sequenced with ABI SOLiD and 454/Roche Reference obtained through Sanger reads Artificial datasets created with varying amounts of coverage Tested in colour-space alone (against Corona), letter-space alone (against gigaBayes) with various aligners and with a combination of data

VALiDation Measures: True Positives (correctly identified SNPs) False Positives (SNPs not in Sanger set) Precision (TP as fraction of all predictions) Recall (TP as fraction of Sanger set SNPs)

VALiDation Colour space only In colour space, VARiD had slightly higher precision than the Corona caller on AB-mapped reads, but had comparable and slightly lower recall. Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper. (no significance statistics were presented)

VALiDation Letter Space Only In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage. VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage

VALiDation Mixed space VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost-equivalent letter-only or colour-only data:

Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.)

Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.) Any more?

The End.

References • Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A Variation Detection Framework for Color-space and Letter-space platforms. 2010 (in progress) • Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color-space and Letter-space (poster) • Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from http://en.wikipedia.org/w/index.php?title=Hidden_Markov_model&oldid=341442380 • Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Presentation Transcript

Lunar Plant Growth Chamber

Space Elevator

Practical Space Management in Data Warehouse Environments

Care of the Client with Chest Tubes

Methods of gaining space in the permanent dentition Types of clinical cases in Orthodontics: 1 Definite extraction cases

Chapter 2: Image Analysis

Systematic Layout Planning

Chapter: Exploring Space

Access Control

Exploring Space

Space

Sculpture and Site Specific Art

CONFINED SPACE

Understanding Seismic Events

THE SPACE TRAVEL AND THE SPACE SHUTTLE

Storefronts

Time Space and Time-Space

Discrete Transform

Understanding Seismic Events

Honors Space Science Unit (Textbook reference Chapters 12, 24, 25 and 26)

Pictures of the year: Space