Kmerstream
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

KmerStream PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

KmerStream. Streaming algorithms for k- mer abundance estimation P áll Melsted @ pmelsted joint work with Bjarni V. Halld órsson. Error rates vs. Quality values. What error rates can we expect from NGS Specifically whole genome sequencing with Illumina sequencing technology

Download Presentation

KmerStream

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Kmerstream

KmerStream

Streaming algorithms for k-mer abundance estimation

Páll Melsted @pmelsted

joint work with Bjarni V. Halldórsson


Error rates vs quality values

Error rates vs. Quality values

  • What error rates can we expect from NGS

    • Specifically whole genome sequencing with Illumina sequencing technology

  • How informative are quality values

    • Rubbish?

    • Worth using for analysis?


Quality values

Quality values

  • A probability estimate that the basecall is correct

    • @SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C

  • Phred scale,

    • Pr[base call incorrect] ~ 10-Q/10

!=33, bad basecall


Error rates

Error rates

  • What percentage of basecalls are correct

  • How to estimate

    • Align reads to a reference

    • Count mismatches and non-alignments

    • Correct for snps and variants.

  • Reference free

    • Whole genome assembly?


K mer counting

K-mer counting

  • Count k-mers, want k large, say ~31.

GATTTGGGGTTCAAAGCAGTA

GAT

ATT

TTT

TTG

TGG

...

GAT 2

ATT 3

TTT 2

TTG 3

TGG 2

...

ATTTGGGGTTGATT

ATT

TTT

TTG

TGG

GGG

GGT

GTT

TTG

TGA

GAT

ATT


Errors and k mers

Errors and k-mers

  • Basecall errors impact many k-mers

GATTTGGGGTTCAAAGCAGTA

GATTT

ATTTG

TTTGG

TTGGG

TGGGG

GGGGT

...

AAGCAG AGCAGT

GCAGTA

GATTTGGGGTTCAAAGCAGTA

GATTT

ATTTG

TTTGG

TTGGG

TGGGG

GGGGT

GGGTT

GGTTC


Errors and k mers1

Errors and k-mers

  • Basecall errors are not independent

    • Multiple errors more likely

    • Ends of reads contain more errors

  • K-mer error rate underestimates true basecall error rate

    • Discounts reads with many errors or errors at the ends

    • Can be off by a factor of 2


Frequency histograms

Frequency histograms

  • Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1


Na ve method

Naïve method

  • Assumptions:

    • Sampling from a genome of size G

    • Poisson distribution, Poi(λ), of coverage of each position

    • Each k-mer sampled is an error with probε independently.

    • When we sample an error k-mer, it is replaced by a single nucleotide substitution at random


Na ve model

Naïve model

Sample random position

  • Probability that a k-mer has coverage 1

    • εPr[error k-mer has cov 1]+ (1-ε) Pr[true k-mer has cov 1]

Genome length G

TGAC

Introduce one error

ε

1-ε

Produce correct k-mer

TGGC

TGAC


Frequency moments

Frequency moments

  • From the frequency histogram we define

  • fi = number of k-mers with coverage i

  • f1 = number of singletons

  • F0 = number of distinct k-mers = Σ fi

  • F1 = number of all k-mers = Σi fi


Fitting the model

Fitting the model

  • 3 unknown parameters G, λ, ε

  • 3 k-mer frequency statistics, f1, F1, F0


Computing the moments

Computing the moments

  • Count all k-mers? – very memory intensive

  • Sample k-mers (àla KmerGenie)

  • Streaming algorithm, KmerStream

    • Estimates f1, F0, F1 directly without storing any k-mers

    • Accuracy can be specified (default ~2%)


Kmerstream1

KmerStream

  • Very fast, 5-10s per million reads

  • Low memory overhead, ~11M

  • One pass over the dataset

    • Uses hashing to sample k-mers adaptively

    • Lossy counting similar to Bloom filter

    • Does not keep track of k-mers

  • 2-3x faster than KmerGenie, 10x better memory


Validation

Validation

  • Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times.

KmerStream estimates True kmer counts


Real data

Real data

  • Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.

  • KmerStream run for all samples, model fit to estimate k-mer error rates for k=31


K mer error rates

K-mer error rates


Quality cutoff

Quality cutoff

  • Keep only k-mers in reads where quality is above q.

  • Run for q = 0, 13, 20, 30.

  • Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001


K mer error rates1

K-mer error rates

Moving from q0 to q13 huge improvement

q20 to q30 not recommended, 50% samples increased error rate


Wrap up

Wrap up

  • Quality values are informative

    • Can get speed up by prioritizing processing based on quality values e.g. alignment

  • Error rates are highly variable

  • Quality value cutoffs can be done on a case by case basis with minimal overhead.


Thank you

Thank you

  • Paper on bioRxiv

  • Code on github.com/pmelsted/KmerStream

  • Ph.D. position available “Streaming algorithms for whole genome assembly.”


  • Login