Kmerstream
Download
1 / 21

KmerStream - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

KmerStream. Streaming algorithms for k- mer abundance estimation P áll Melsted @ pmelsted joint work with Bjarni V. Halld órsson. Error rates vs. Quality values . What error rates can we expect from NGS Specifically whole genome sequencing with Illumina sequencing technology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' KmerStream' - izzy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Kmerstream

KmerStream

Streaming algorithms for k-mer abundance estimation

Páll Melsted @pmelsted

joint work with Bjarni V. Halldórsson


Error rates vs quality values
Error rates vs. Quality values

  • What error rates can we expect from NGS

    • Specifically whole genome sequencing with Illumina sequencing technology

  • How informative are quality values

    • Rubbish?

    • Worth using for analysis?


Quality values
Quality values

  • A probability estimate that the basecall is correct

    • @SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C

  • Phred scale,

    • Pr[base call incorrect] ~ 10-Q/10

!=33, bad basecall


Error rates
Error rates

  • What percentage of basecalls are correct

  • How to estimate

    • Align reads to a reference

    • Count mismatches and non-alignments

    • Correct for snps and variants.

  • Reference free

    • Whole genome assembly?


K mer counting
K-mer counting

  • Count k-mers, want k large, say ~31.

GATTTGGGGTTCAAAGCAGTA

GAT

ATT

TTT

TTG

TGG

...

GAT 2

ATT 3

TTT 2

TTG 3

TGG 2

...

ATTTGGGGTTGATT

ATT

TTT

TTG

TGG

GGG

GGT

GTT

TTG

TGA

GAT

ATT


Errors and k mers
Errors and k-mers

  • Basecall errors impact many k-mers

GATTTGGGGTTCAAAGCAGTA

GATTT

ATTTG

TTTGG

TTGGG

TGGGG

GGGGT

...

AAGCAG AGCAGT

GCAGTA

GATTTGGGGTTCAAAGCAGTA

GATTT

ATTTG

TTTGG

TTGGG

TGGGG

GGGGT

GGGTT

GGTTC


Errors and k mers1
Errors and k-mers

  • Basecall errors are not independent

    • Multiple errors more likely

    • Ends of reads contain more errors

  • K-mer error rate underestimates true basecall error rate

    • Discounts reads with many errors or errors at the ends

    • Can be off by a factor of 2


Frequency histograms
Frequency histograms

  • Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1


Na ve method
Naïve method

  • Assumptions:

    • Sampling from a genome of size G

    • Poisson distribution, Poi(λ), of coverage of each position

    • Each k-mer sampled is an error with probε independently.

    • When we sample an error k-mer, it is replaced by a single nucleotide substitution at random


Na ve model
Naïve model

Sample random position

  • Probability that a k-mer has coverage 1

    • εPr[error k-mer has cov 1]+ (1-ε) Pr[true k-mer has cov 1]

Genome length G

TGAC

Introduce one error

ε

1-ε

Produce correct k-mer

TGGC

TGAC


Frequency moments
Frequency moments

  • From the frequency histogram we define

  • fi = number of k-mers with coverage i

  • f1 = number of singletons

  • F0 = number of distinct k-mers = Σ fi

  • F1 = number of all k-mers = Σi fi


Fitting the model
Fitting the model

  • 3 unknown parameters G, λ, ε

  • 3 k-mer frequency statistics, f1, F1, F0


Computing the moments
Computing the moments

  • Count all k-mers? – very memory intensive

  • Sample k-mers (àla KmerGenie)

  • Streaming algorithm, KmerStream

    • Estimates f1, F0, F1 directly without storing any k-mers

    • Accuracy can be specified (default ~2%)


Kmerstream1
KmerStream

  • Very fast, 5-10s per million reads

  • Low memory overhead, ~11M

  • One pass over the dataset

    • Uses hashing to sample k-mers adaptively

    • Lossy counting similar to Bloom filter

    • Does not keep track of k-mers

  • 2-3x faster than KmerGenie, 10x better memory


Validation
Validation

  • Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times.

KmerStream estimates True kmer counts


Real data
Real data

  • Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.

  • KmerStream run for all samples, model fit to estimate k-mer error rates for k=31


K mer error rates
K-mer error rates


Quality cutoff
Quality cutoff

  • Keep only k-mers in reads where quality is above q.

  • Run for q = 0, 13, 20, 30.

  • Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001


K mer error rates1
K-mer error rates

Moving from q0 to q13 huge improvement

q20 to q30 not recommended, 50% samples increased error rate


Wrap up
Wrap up

  • Quality values are informative

    • Can get speed up by prioritizing processing based on quality values e.g. alignment

  • Error rates are highly variable

  • Quality value cutoffs can be done on a case by case basis with minimal overhead.


Thank you
Thank you

  • Paper on bioRxiv

  • Code on github.com/pmelsted/KmerStream

  • Ph.D. position available “Streaming algorithms for whole genome assembly.”


ad