combinatorial pattern matching
Download
Skip this Video
Download Presentation
Combinatorial Pattern Matching

Loading in 2 Seconds...

play fullscreen
1 / 102

combinatorial pattern matching - PowerPoint PPT Presentation


  • 227 Views
  • Uploaded on

Combinatorial Pattern Matching. Outline. Hash Tables Repeat Finding Exact Pattern Matching Keyword Trees Suffix Trees Heuristic Similarity Search Algorithms Approximate String Matching Filtration Comparing a Sequence Against a Database Algorithm behind BLAST Statistics behind BLAST

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'combinatorial pattern matching' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Hash Tables
  • Repeat Finding
  • Exact Pattern Matching
  • Keyword Trees
  • Suffix Trees
  • Heuristic Similarity Search Algorithms
  • Approximate String Matching
  • Filtration
  • Comparing a Sequence Against a Database
  • Algorithm behind BLAST
  • Statistics behind BLAST
  • PatternHunter and BLAT
genomic repeats
Genomic Repeats
  • Example of repeats:
    • ATGGTCTAGGTCCTAGTGGTC
  • Motivation to find them:
    • Genomic rearrangements are often associated with repeats
    • Trace evolutionary secrets
    • Many tumors are characterized by an explosion of repeats
genomic repeats4
Genomic Repeats
  • The problem is often more difficult:
    • ATGGTCTAGGACCTAGTGTTC
  • Motivation to find them:
    • Genomic rearrangements are often associated with repeats
    • Trace evolutionary secrets
    • Many tumors are characterized by an explosion of repeats
l mer repeats
l-mer Repeats
  • Long repeats are difficult to find
  • Short repeats are easy to find (e.g., hashing)
  • Simple approach to finding long repeats:
    • Find exact repeats of short l-mers (l is usually 10 to 13)
    • Use l-mer repeats to potentially extend into longer, maximal repeats
l mer repeats cont d
l-mer Repeats (cont’d)
  • There are typically many locations where an l-mer is repeated:

GCTTACAGATTCAGTCTTACAGATGGT

  • The 4-mer TTAC starts at locations 3 and 17
extending l mer repeats
Extending l-mer Repeats

GCTTACAGATTCAGTCTTACAGATGGT

  • Extend these 4-mer matches:

GCTTACAGATTCAGTCTTACAGATGGT

  • Maximal repeat: TTACAGAT
maximal repeats
Maximal Repeats
  • To find maximal repeats in this way, we need ALL start locations of all l-mers in the genome
  • Hashing lets us find repeats quickly in this manner
hashing what is it
Hashing: What is it?
  • What does hashing do?
    • For different data, generate a unique integer
    • Store data in an array at the unique integer index generated from the data
  • Hashing is a very efficient way to store and retrieve data
hashing definitions
Hashing: Definitions
  • Hash table: array used in hashing
  • Records: data stored in a hash table
  • Keys: identifies sets of records
  • Hash function: uses a key to generate an index to insert at in hash table
  • Collision: when more than one record is mapped to the same index in the hash table
hashing example
Hashing: Example
  • Where do the animals eat?
  • Records: each animal
  • Keys: where each animal eats
hashing dna sequences
Hashing DNA sequences
  • Each l-mer can be translated into a binary string (A, T, C, G can be represented as 00, 01, 10, 11)
  • After assigning a unique integer per l-mer it is easy to get all start locations of each l-mer in a genome
hashing maximal repeats
Hashing: Maximal Repeats
  • To find repeats in a genome:
    • For all l-mers in the genome, note the start position and the sequence
    • Generate a hash table index for each unique l-mer sequence
    • In each index of the hash table, store all genome start locations of the l-mer which generated that index
    • Extend l-mer repeats to maximal repeats
hashing collisions
Hashing: Collisions
  • Dealing with collisions:
    • “Chain” all start locations of l-mers (linked list)
hashing summary
Hashing: Summary
  • When finding genomic repeats from l-mers:
    • Generate a hash table index for each l-mer sequence
    • In each index, store all genome start locations of the l-mer which generated that index
    • Extend l-mer repeats to maximal repeats
pattern matching
Pattern Matching
  • What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern?
  • This leads us to a different problem, the Pattern MatchingProblem
pattern matching problem
Pattern Matching Problem
  • Goal: Find all occurrences of a pattern in a text
  • Input: Pattern p = p1…pn and text t = t1…tm
  • Output: All positions 1<i< (m – n + 1) such that the n-letter substring of t starting at i matches p
  • Motivation: Searching database for a known pattern
exact pattern matching a brute force algorithm
Exact Pattern Matching: A Brute-Force Algorithm

PatternMatching(p,t)

  • n length of pattern p
  • m length of text t
  • for i  1 to (m – n + 1)
  • if ti…ti+n-1 = p
  • outputi
exact pattern matching an example
Exact Pattern Matching: An Example

GCAT

  • PatternMatching algorithm for:
    • Pattern GCAT
    • Text CGCATC

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

exact pattern matching running time
Exact Pattern Matching: Running Time
  • PatternMatching runtime: O(nm)
  • Probability-wise, it’s more like O(m)
    • Rarely will there be close to n comparisons in line 4
  • Better solution: suffix trees
    • Can solve problem in O(m) time
    • Conceptually related to keyword trees
keyword trees example
Keyword Trees: Example
  • Keyword tree:
    • Apple
keyword trees example cont d
Keyword Trees: Example (cont’d)
  • Keyword tree:
    • Apple
    • Apropos
keyword trees example cont d23
Keyword Trees: Example (cont’d)
  • Keyword tree:
    • Apple
    • Apropos
    • Banana
keyword trees example cont d24
Keyword Trees: Example (cont’d)
  • Keyword tree:
    • Apple
    • Apropos
    • Banana
    • Bandana
keyword trees example cont d25
Keyword Trees: Example (cont’d)
  • Keyword tree:
    • Apple
    • Apropos
    • Banana
    • Bandana
    • Orange
keyword trees properties
Keyword Trees: Properties
  • Stores a set of keywords in a rooted labeled tree
  • Each edge labeled with a letter from an alphabet
  • Any two edges coming out of the same vertex have distinct labels
  • Every keyword stored can be spelled on a path from root to some leaf
keyword trees threading cont d
Keyword Trees: Threading (cont’d)
  • Thread “appeal”
    • appeal
keyword trees threading cont d28
Keyword Trees: Threading (cont’d)
  • Thread “appeal”
    • appeal
keyword trees threading cont d29
Keyword Trees: Threading (cont’d)
  • Thread “appeal”
    • appeal
keyword trees threading cont d30
Keyword Trees: Threading (cont’d)
  • Thread “appeal”
    • appeal
multiple pattern matching problem
Multiple Pattern Matching Problem
  • Goal: Given a set of patterns and a text, find all occurrences of any of patterns in text
  • Input: k patterns p1,…,pk, and text t = t1…tm
  • Output: Positions 1 <i <m where substring of t starting at i matches pj for 1 <j<k
  • Motivation: Searching database for known multiple patterns
multiple pattern matching straightforward approach
Multiple Pattern Matching: Straightforward Approach
  • Can solve as k “Pattern Matching Problems”
    • Runtime:

O(kmn)

using the PatternMatching algorithm k times

    • m - length of the text
    • n - average length of the pattern
multiple pattern matching keyword tree approach
Multiple Pattern Matching: Keyword Tree Approach
  • Or, we could use keyword trees:
    • Build keyword tree in O(N) time; N is total length of all patterns
    • With naive threading: O(N + nm)
    • Aho-Corasick algorithm: O(N + m)
keyword trees threading
Keyword Trees: Threading
  • To match patterns in a text using a keyword tree:
    • Build keyword tree of patterns
    • “Thread” the text through the keyword tree
keyword trees threading cont d40
Keyword Trees: Threading (cont’d)
  • Threading is “complete” when we reach a leaf in the keyword tree
  • When threading is “complete,” we’ve found a pattern in the text
suffix trees collapsed keyword trees
Suffix Trees=Collapsed Keyword Trees
  • Similar to keyword trees, except edges that form paths are collapsed
    • Each edge is labeled with a substring of a text
    • All internal edges have at least two outgoing edges
    • Leaves labeled by the index of the pattern.
suffix tree of a text
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

suffix tree of a text43
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

How much time does it take?

suffix tree of a text44
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

quadratic

Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text

suffix trees advantages
Suffix Trees: Advantages
  • Suffix trees of a text is constructed for all its suffixes
  • Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

quadratic

linear (Weiner suffix tree algorithm)

use of suffix trees
Use of Suffix Trees
  • Suffix trees hold all suffixes of a text
    • i.e., ATCGC: ATCGC, TCGC, CGC, GC, C
    • Builds in O(m) time for text of length m
  • To find any pattern of length n in a text:
    • Build suffix tree for text
    • Thread the pattern through the suffix tree
    • Can find pattern in text in O(n) time!
  • O(n + m) time for “Pattern Matching Problem”
    • Build suffix tree and lookup pattern
pattern matching with suffix trees
Pattern Matching with Suffix Trees

SuffixTreePatternMatching(p,t)

  • Build suffix tree for text t
  • Thread pattern p through suffix tree
  • if threading is complete
  • output positions of all p-matching leaves in the tree
  • else
  • output “Pattern does not appear in text”
multiple pattern matching summary
Multiple Pattern Matching: Summary
  • Keyword and suffix trees are used to find patterns in a text
  • Keyword trees:
    • Build keyword tree of patterns, and thread text through it
  • Suffix trees:
    • Build suffix tree of text, and thread patterns through it
approximate vs exact pattern matching
Approximate vs. Exact Pattern Matching
  • So far all we’ve seen exact pattern matching algorithms
  • Usually, because of mutations, it makes much more biological sense to find approximate pattern matches
  • Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches
heuristic similarity searches
Heuristic Similarity Searches
  • Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow
  • Alignment of two sequences usually has short identical or highly similar fragments
  • Many heuristic methods (i.e., FASTA) are based on the same idea of filtration
    • Find short exact matches, and use them as seeds for potential match extension
    • “Filter” out positions with no extendable matches
dot matrices
Dot Matrices
  • Dot matrices show similarities between two sequences
  • FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)
dot matrices cont d
Dot Matrices (cont’d)
  • Identify diagonals above a threshold length
  • Diagonals in the dot matrix indicate exact substring matching
diagonals in dot matrices
Diagonals in Dot Matrices
  • Extend diagonals and try to link them together, allowing for minimal mismatches/indels
  • Linking diagonals reveals approximate matches over longer substrings
approximate pattern matching problem
Approximate Pattern Matching Problem
  • Goal: Find all approximate occurrences of a pattern in a text
  • Input: A pattern p = p1…pn, text t = t1…tm, and k, the maximum number of mismatches
  • Output: All positions 1 <i< (m – n +1) such that ti…ti+n-1 and p1…pn have at most k mismatches (i.e., Hamming distance between ti…ti+n-1 and p <k)
approximate pattern matching a brute force algorithm
Approximate Pattern Matching: A Brute-Force Algorithm

ApproximatePatternMatching(p, t, k)

  • n length of pattern p
  • m  length of text t
  • fori  1 to m – n + 1
  • dist  0
  • forj  1 to n
  • ifti+j-1 != pj
  • dist dist + 1
  • ifdist<k
  • outputi
approximate pattern matching running time
Approximate Pattern Matching: Running Time
  • That algorithm runs in O(nm).
  • Landau-Vishkin algorithm: O(kn)
  • We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”:
    • We want to match substrings in a query to substrings in a text with at most k mismatches
    • Motivation: we want to see similarities to some gene, but we may not know which parts of the gene to look for
query matching problem
Query Matching Problem
  • Goal: Find all substrings of the query that approximately match the text
  • Input: Query q = q1…qw,

text t = t1…tm,

n (length of matching substrings),

k (maximum number of mismatches)

  • Output: All pairs of positions (i, j) such that the

n-letter substring of q starting at iapproximately matches the

n-letter substring of t starting at j,

with at most k mismatches

query matching main idea
Query Matching: Main Idea
  • Approximately matching strings share some perfectly matching substrings.
  • Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).
filtration in query matching
Filtration in Query Matching
  • We want all n-matches between a query and a text with up to k mismatches
  • “Filter” out positions we know do not match between text and query
  • Potential match detection: find all matches of l-tuples in query and text for some small l
  • Potential match verification: Verify each potential match by extending it to the left and right, until (k + 1) mismatches are found
filtration match detection
Filtration: Match Detection
  • If x1…xn and y1…yn match with at most k mismatches, they must share an l-tuple that is perfectly matched, with l = n/(k + 1)
  • Break string of length n into k+1 parts, each each of length n/(k + 1)
    • k mismatches can affect at most k of these k+1 parts
    • At least one of these k+1 parts is perfectly matched
filtration match detection cont d
Filtration: Match Detection(cont’d)
  • Suppose k = 3. We would then have l=n/(k+1)=n/4:
  • There are at most k mismatches in n, so at the very least there must be one out of the k+1 l–tuples without a mismatch

1…l

l+1…2l

2l+1…3l

3l+1…n

1

2

k

k+ 1

filtration match verification
Filtration: Match Verification
  • For each l -match we find, try to extend the match further to see if it is substantial

Extend perfect match of lengthl

until we find an approximate match of length n with k mismatches

query

text

filtration example
Filtration: Example

Shorter perfectmatches required

Performance decreases

local alignment is to slow
Local alignment is to slow…
  • Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
local alignment is to slow67
Local alignment is to slow…
  • Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
local alignment is to slow68
Local alignment is to slow…
  • Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
local alignment is to slow69
Local alignment is to slow…
  • Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
  • Guaranteed to find the optimal local alignment
  • Sets the standard for sensitivity
local alignment is to slow70
Local alignment is to slow…
  • Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
  • Basic Local Alignment Search Tool
    • Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.

Journal of Mol. Biol., 1990

  • Search sequence databases for local alignments to a query
blast
BLAST
  • Great improvement in speed, with a modest decrease in sensitivity
  • Minimizes search space instead of exploring entire search space between two sequences
  • Finds short exact matches (“seeds”), only explores locally around these “hits”
what similarity reveals
What Similarity Reveals
  • BLASTing a new gene
    • Evolutionary relationship
    • Similarity between protein function
  • BLASTing a genome
    • Potential genes
blast algorithm
BLAST algorithm
  • Keyword search of all words of length w from the query of length n in database of length m with score above threshold
    • w = 11 for DNA queries, w =3 for proteins
  • Local alignment extension for each found keyword
    • Extend result until longest match above threshold is achieved
  • Running time O(nm)
blast algorithm cont d
BLAST algorithm (cont’d)

keyword

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

GVK 18

GAK 16

GIK 16

GGK 14

GLK 13

GNK 12

GRK 11

GEK 11

GDK 11

Neighborhood

words

neighborhood

score threshold

(T = 13)

extension

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60

+++DN +G + IR L G+K I+ L+ E+ RG++K

Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

High-scoring Pair (HSP)

original blast
Original BLAST
  • Dictionary
    • All words of length w
  • Alignment
    • Ungapped extensions until score falls below some statistical threshold
  • Output
    • All local alignments with score > threshold
original blast example
Original BLAST: Example

A C G A A G T A A G G T C C A G T

  • w = 4
  • Exact keyword match of GGTC
  • Extend diagonals with mismatches until score is under 50%
  • Output result
    • GTAAGGTCC
    • GTTAGGTCC

C T G A T C C T G G A T T G C G A

From lectures by Serafim Batzoglou (Stanford)

gapped blast example
Gapped BLAST : Example

A C G A A G T A A G G T C C A G T

  • Original BLAST exact keyword search, THEN:
  • Extend with gaps around ends of exact matchuntil score < threshold
  • Output result

GTAAGGTCCAGT

GTTAGGTC-AGT

C T G A T C C T G G A T T G C G A

From lectures by Serafim Batzoglou (Stanford)

incarnations of blast
Incarnations of BLAST
  • blastn: Nucleotide-nucleotide
  • blastp: Protein-protein
  • blastx: Translated query vs. protein database
  • tblastn: Protein query vs. translated database
  • tblastx: Translated query vs. translated

database (6 frames each)

incarnations of blast cont d
Incarnations of BLAST (cont’d)
  • PSI-BLAST
    • Find members of a protein family or build a custom position-specific score matrix
  • Megablast:
    • Search longer sequences with fewer differences
  • WU-BLAST: (Wash U BLAST)
    • Optimized, added features
assessing sequence similarity
Assessing sequence similarity
  • Need to know how strong an alignment can be expected from chance alone
  • “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model
  • Sequence models may take into account:
    • G+C content
    • Poly-A tails
    • “Junk” DNA
    • Codon bias
    • Etc.
blast segment score
BLAST: Segment Score
  • BLAST uses scoring matrices (d) to improve on efficiency of match detection
    • Some proteins may have very different amino acid sequences, but are still similar
  • For any two l-mers x1…xl and y1…yl :
    • Segment pair: pair of l-mers, one from each sequence
    • Segment score: Sli=1d(xi, yi)
blast locally maximal segment pairs
BLAST: Locally Maximal Segment Pairs
  • A segment pair is maximal if it has the best score over all segment pairs
  • A segment pair is locally maximal if its score can’t be improved by extending or shortening
  • Statistically significant locally maximal segment pairs are of biological interest
  • BLAST finds all locally maximal segment pairs with scores above some threshold
    • A significantly high threshold will filter out some statistically insignificant matches
blast statistics
BLAST: Statistics
  • Threshold: Altschul-Dembo-Karlin statistics
    • Identifies smallest segment score that is unlikely to happen by chance
  • # matches above q has mean E(q) = Kmne-lq; K is a constant, m and n are the lengths of the two compared sequences
    • Parameter l is positive root of:

Sx,y in A(pxpyed(x,y)) = 1, where px and py are frequenceies of amino acids x and y, and A is the twenty letter amino acid alphabet

p values
P-values
  • The probability of finding b HSPs with a score ≥S is given by:
    • (e-EEb)/b!
  • For b = 0, that chance is:
    • e-E
  • Thus the probability of finding at least one HSP with a score ≥S is:
    • P = 1 – e-E
sample blast output
Sample BLAST output
  • Blast of human beta globin protein against zebra fish

Score E

Sequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44

gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44

gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44

gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS

>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]

Length = 148

Score = 171 bits (434), Expect = 3e-44

Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60

MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK

Sbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120

V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG

Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147

+ F VQ A+QK +A V +AL +YH

Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

sample blast output cont d
Sample BLAST output (cont’d)
  • Blast of human beta globin DNA against human DNA

Score E

Sequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75

gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3\' end 289 1e-75

gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72

gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66

gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34

gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS

>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region ([email protected]) on chromosome 11

Length = 81706

Score = 149 bits (75), Expect = 3e-33

Identities = 183/219 (83%)

Strand = Plus / Plus

Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326

|| ||| | || | || | |||||| ||||| ||||||||||| ||||||||

Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365

||||||||| |||||||||| ||||| ||||||||||||

Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

timeline
Timeline
  • 1970: Needleman-Wunsch global alignment algorithm
  • 1981: Smith-Waterman local alignment algorithm
  • 1985: FASTA
  • 1990: BLAST (basic local alignment search tool)
  • 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve!
    • Pattern Hunter
    • BLAT
patternhunter faster and even more sensitive
BLAST: matches short consecutive sequences (consecutive seed)

Length = k

Example (k = 11):

11111111111

Each 1 represents a “match”

PatternHunter: matches short non-consecutive sequences (spaced seed)

Increases sensitivity by locating homologies that would otherwise be missed

Example (a spaced seed of length 18 w/ 11 “matches”):

111010010100110111

Each 0 represents a “don’t care”, so there can be a match or a mismatch

PatternHunter: faster and even more sensitive
spaced seeds
Example of a hit using a spaced seed:Spaced seeds

How does this result in better sensitivity?

why is ph better
Why is PH better?
  • BLAST: redundant hits
  • PatternHunter

This results in > 1 hit and creates clusters of redundant hits

This results in very few redundant hits

why is ph better91
Why is PH better?

BLAST may also miss a hit

GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT

|| ||||||||| |||||| | |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

9 matches

In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit!

Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed

advantage of gapped seeds
Advantage of Gapped Seeds

11 positions

11 positions

10 positions

why is ph better93
Why is PH better?
  • Higher hit probability
  • Lower expected number of random hits
use of multiple seeds
Use of Multiple Seeds

Basic Searching Algorithm

  • Select a group of spaced seed models
  • For each hit of each model, conduct extension to find a homology.
another method blat
Another method: BLAT
  • BLAT (BLAST-Like Alignment Tool)
  • Same idea as BLAST - locate short sequence hits and extend
blat vs blast differences
BLAT vs. BLAST: Differences
  • BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database
  • Index is stored in RAM which is memory intensive, but results in faster searches
blat fast cdna alignments
BLAT: Fast cDNA Alignments

Steps:

1. Break cDNA into 500 base chunks.

2. Use an index to find regions in genome similar to each chunk of cDNA.

3. Do a detailed alignment between genomic regions and cDNA chunk.

4. Use dynamic programming to stitch together detailed alignments of chunks into detailed alignment of whole.

blat indexing
BLAT: Indexing
  • An index is built that contains the positions of each k-mer in the genome
  • Each k-mer in the query sequence is compared to each k-mer in the index
  • A list of ‘hits’ is generated - positions in cDNA and in genome that match for k bases
indexing an example
Indexing: An Example

Here is an example with k = 3:

Genome:cacaattatcacgaccgc

3-mers (non-overlapping): cac aat tat cac gac cgc

Index: aat 3 gac 12

cac 0,9 tat 6

cgc 15

cDNA (query sequence): aattctcac

3-mers (overlapping): aat att ttc tct ctc tca cac

0 1 2 3 4 5 6

Hits: aat 0,3

cac 6,0

cac 6,9

clump: cacAATtatCACgaccgc

Multiple instances map to single index

Position of 3-mer in query, genome

however
However…
  • BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments
patternhunter and blat vs blast
PatternHunter and BLAT vs. BLAST
  • PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity
  • BLAT is several times faster than BLAST, but best results are limited to closely related sequences
resources
Resources
  • tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt
  • http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf
  • http://www.genomeblat.com/genomeblat/blatRapShow.pps
ad