better filtering with gapped q grams n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Better Filtering with Gapped q -grams PowerPoint Presentation
Download Presentation
Better Filtering with Gapped q -grams

Loading in 2 Seconds...

play fullscreen
1 / 23

Better Filtering with Gapped q -grams - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Better Filtering with Gapped q -grams. S. Burkhardt. J. Kärkkäinen. Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken. Outline. Motivation The `classic` q -gram Lemma q -shapes Measuring Filter quality/speed Experimental Results Conclusion.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Better Filtering with Gapped q -grams' - alexis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
better filtering with gapped q grams

Better Filtering withGapped q-grams

S. Burkhardt

J. Kärkkäinen

Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken

outline
Outline
  • Motivation
  • The `classic` q-gram Lemma
  • q-shapes
  • Measuring Filter quality/speed
  • Experimental Results
  • Conclusion
slide3

The Problem

The k-mismatches problem

For a pattern P, a string S, a value k :

find all occurences of P in S with at

most k character replacements.

slide4

A common Approach

Filter Algorithms

Filtration Stage:

Examine S with a Filter Criterium

Return areas with potential matches

Verification Stage:

Verify which areas have true matches

slide5

Problem Definition

String S

G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T

Pattern P

A C T C

k = 1

  • Find occurences of P with at most k errors
slide6

q-gram Lemma

The q-gram Lemma

For a pattern P, a string S, a value k:

Matches to P in S with at most k errors contain at least

|P|-q+1-(kq)

substrings of length q (q-grams) from S.

slide7

q-gram Lemma

T C G

C G A

G A T

A T T

T T A

T A C

T C G A T T A C

T C G A T T A C

q = 3

# of q-grams :

|P| - q + 1

k = 1

|P| = 8

=> t = 8-3+1-1 = 5

G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

Error number k :

at least t = |P| - q + 1 - (qk) common q-grams in |P| letters

slide8

Some Definitions

In the DP

matrix, one

can count

the number

of matching

q-grams

per diagonal

slide9

q-shapes

General idea:

  • Use substrings with gaps (q-shapes)
  • compute correct threshold t
  • total length s is called span

|Q| = 11

k = 3

3-shape

##.#

s = 4

1 gap

t = 1

OOXOOXOOXOO

OOX

OXO

XOO

OOX

OXO

XOO

OOX

OXO

XOO

3-gram

###

t = 0

no filter!

OOOXXOOXOOO

OO.X

OO.X

OX.O

XX.O

XO.X

OO.O

OX.O

XO.O

O = match, X = mismatch

slide10

q-shapes

Judging the quality of q-shapes I

We developed a DP based approach for computing the threshold t given a q-shape

and a query length |P|

Observation: The threshold t is not the only factor that influences the behaviour of a q-shape

slide11

q-shapes

Judging the quality of q-shapes II

We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S

##.#

##.#

-----

For t=2 and

the 3-shape

##.#

the minimum

coverage is 5

slide12

q-shapes

Judging the quality of q-shapes III

The value q (i.e.the number of matching characters in a shape) determines the expected number of occurences in a random string S

3-shape: ##.#

S = {A,C,G,T}

Expected number of

occurences of a single 3-shape in S:

occ = |S|

1

|S|q

slide13

q-shapes

Judging the quality of q-shapes IV

The speed of the filter step

is influenced by the expected

number of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage

Speed:

value of q

Efficiency:

minimum coverage

slide14

q-shapes

Judging the quality of q-shapes V

Shapes with maximal

minimum coverage for:

|Q| = 50, k=5

q=6 : ##......#..#..#.#

q=9 : ###..#..#.#...#.##

q=10: ###..#..#.#..###.#

q=11: #######.##.##

q=12: ###.#..###.#..###.#

Good shapes are not neccessarily regular or predictable in

their form.

slide17

Experiments

Evaluating q-shapes

  • Experimental setup for q-shapes:
  • 50 million character random (Bernoulli) string S
  • 1000 random queries of length 500
  • queries have no approximate matches in S
  • compute threshold for |Q|=50
  • actual value of |Q| is 500! (to reduce runtime of tests)
  • Experiments show 10x reduced filter efficiency;

relative performance between shapes unaffected

slide18

Experiments

Evaluating q-shapes

What we measured for every shape and all queries:

A) The total number of occurrences of all shapes

Good indicator of the total work for the filter phase

B) The number of diagonals containing at least t shapes

Good indicator of the filter efficiency

The experiments show a good correlation between

A and the predicted values as well as B and the minimum

coverage

slide22

Conclusion

Our work….

  • An analysis of q-grams with gaps (q-shapes)
  • Results include:
    • experimental evidence for their superiority
    • when compared to standard q-grams
    • a method to roughly judge their quality, the
    • minimum coverage
    • a way to calculate the parameters required to
    • us them in a filter algorithm
slide23

Conclusion

Todo….

  • an algorithm to predict the best shapes
  • improve the quality measure for q-grams
  • extension to the k-differences problem (with
  • insertions and deletions)
  • a thorough analysis of filter behaviour for
  • > k differences (use as a heuristic filter)