Better Filtering with Gapped q -grams

1 / 23

Better Filtering with Gapped q -grams - PowerPoint PPT Presentation

Better Filtering with Gapped q -grams. S. Burkhardt. J. Kärkkäinen. Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken. Outline. Motivation The `classic` q -gram Lemma q -shapes Measuring Filter quality/speed Experimental Results Conclusion.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Better Filtering with Gapped q -grams

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Better Filtering withGapped q-grams

S. Burkhardt

J. Kärkkäinen

Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken

Outline
• Motivation
• The `classic` q-gram Lemma
• q-shapes
• Measuring Filter quality/speed
• Experimental Results
• Conclusion

The Problem

The k-mismatches problem

For a pattern P, a string S, a value k :

find all occurences of P in S with at

most k character replacements.

A common Approach

Filter Algorithms

Filtration Stage:

Examine S with a Filter Criterium

Return areas with potential matches

Verification Stage:

Verify which areas have true matches

Problem Definition

String S

G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T

Pattern P

A C T C

k = 1

• Find occurences of P with at most k errors

q-gram Lemma

The q-gram Lemma

For a pattern P, a string S, a value k:

Matches to P in S with at most k errors contain at least

|P|-q+1-(kq)

substrings of length q (q-grams) from S.

q-gram Lemma

T C G

C G A

G A T

A T T

T T A

T A C

T C G A T T A C

T C G A T T A C

q = 3

# of q-grams :

|P| - q + 1

k = 1

|P| = 8

=> t = 8-3+1-1 = 5

G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

Error number k :

at least t = |P| - q + 1 - (qk) common q-grams in |P| letters

Some Definitions

In the DP

matrix, one

can count

the number

of matching

q-grams

per diagonal

q-shapes

General idea:

• Use substrings with gaps (q-shapes)
• compute correct threshold t
• total length s is called span

|Q| = 11

k = 3

3-shape

##.#

s = 4

1 gap

t = 1

OOXOOXOOXOO

OOX

OXO

XOO

OOX

OXO

XOO

OOX

OXO

XOO

3-gram

###

t = 0

no filter!

OOOXXOOXOOO

OO.X

OO.X

OX.O

XX.O

XO.X

OO.O

OX.O

XO.O

O = match, X = mismatch

q-shapes

Judging the quality of q-shapes I

We developed a DP based approach for computing the threshold t given a q-shape

and a query length |P|

Observation: The threshold t is not the only factor that influences the behaviour of a q-shape

q-shapes

Judging the quality of q-shapes II

We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S

##.#

##.#

-----

For t=2 and

the 3-shape

##.#

the minimum

coverage is 5

q-shapes

Judging the quality of q-shapes III

The value q (i.e.the number of matching characters in a shape) determines the expected number of occurences in a random string S

3-shape: ##.#

S = {A,C,G,T}

Expected number of

occurences of a single 3-shape in S:

occ = |S|

1

|S|q

q-shapes

Judging the quality of q-shapes IV

The speed of the filter step

is influenced by the expected

number of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage

Speed:

value of q

Efficiency:

minimum coverage

q-shapes

Judging the quality of q-shapes V

Shapes with maximal

minimum coverage for:

|Q| = 50, k=5

q=6 : ##......#..#..#.#

q=9 : ###..#..#.#...#.##

q=10: ###..#..#.#..###.#

q=11: #######.##.##

q=12: ###.#..###.#..###.#

Good shapes are not neccessarily regular or predictable in

their form.

Experiments

Evaluating q-shapes

• Experimental setup for q-shapes:
• 50 million character random (Bernoulli) string S
• 1000 random queries of length 500
• queries have no approximate matches in S
• compute threshold for |Q|=50
• actual value of |Q| is 500! (to reduce runtime of tests)
• Experiments show 10x reduced filter efficiency;

relative performance between shapes unaffected

Experiments

Evaluating q-shapes

What we measured for every shape and all queries:

A) The total number of occurrences of all shapes

Good indicator of the total work for the filter phase

B) The number of diagonals containing at least t shapes

Good indicator of the filter efficiency

The experiments show a good correlation between

A and the predicted values as well as B and the minimum

coverage

Conclusion

Our work….

• An analysis of q-grams with gaps (q-shapes)
• Results include:
• experimental evidence for their superiority
• when compared to standard q-grams
• a method to roughly judge their quality, the
• minimum coverage
• a way to calculate the parameters required to
• us them in a filter algorithm

Conclusion

Todo….

• an algorithm to predict the best shapes
• improve the quality measure for q-grams
• extension to the k-differences problem (with
• insertions and deletions)
• a thorough analysis of filter behaviour for
• > k differences (use as a heuristic filter)