Combinatorial Pattern Matching

1 / 102

# combinatorial pattern matching - PowerPoint PPT Presentation

Combinatorial Pattern Matching. Outline. Hash Tables Repeat Finding Exact Pattern Matching Keyword Trees Suffix Trees Heuristic Similarity Search Algorithms Approximate String Matching Filtration Comparing a Sequence Against a Database Algorithm behind BLAST Statistics behind BLAST

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'combinatorial pattern matching' - andrew

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Combinatorial Pattern Matching

Outline
• Hash Tables
• Repeat Finding
• Exact Pattern Matching
• Keyword Trees
• Suffix Trees
• Heuristic Similarity Search Algorithms
• Approximate String Matching
• Filtration
• Comparing a Sequence Against a Database
• Algorithm behind BLAST
• Statistics behind BLAST
• PatternHunter and BLAT
Genomic Repeats
• Example of repeats:
• ATGGTCTAGGTCCTAGTGGTC
• Motivation to find them:
• Genomic rearrangements are often associated with repeats
• Trace evolutionary secrets
• Many tumors are characterized by an explosion of repeats
Genomic Repeats
• The problem is often more difficult:
• ATGGTCTAGGACCTAGTGTTC
• Motivation to find them:
• Genomic rearrangements are often associated with repeats
• Trace evolutionary secrets
• Many tumors are characterized by an explosion of repeats
l-mer Repeats
• Long repeats are difficult to find
• Short repeats are easy to find (e.g., hashing)
• Simple approach to finding long repeats:
• Find exact repeats of short l-mers (l is usually 10 to 13)
• Use l-mer repeats to potentially extend into longer, maximal repeats
l-mer Repeats (cont’d)
• There are typically many locations where an l-mer is repeated:

GCTTACAGATTCAGTCTTACAGATGGT

• The 4-mer TTAC starts at locations 3 and 17
Extending l-mer Repeats

GCTTACAGATTCAGTCTTACAGATGGT

• Extend these 4-mer matches:

GCTTACAGATTCAGTCTTACAGATGGT

• Maximal repeat: TTACAGAT
Maximal Repeats
• To find maximal repeats in this way, we need ALL start locations of all l-mers in the genome
• Hashing lets us find repeats quickly in this manner
Hashing: What is it?
• What does hashing do?
• For different data, generate a unique integer
• Store data in an array at the unique integer index generated from the data
• Hashing is a very efficient way to store and retrieve data
Hashing: Definitions
• Hash table: array used in hashing
• Records: data stored in a hash table
• Keys: identifies sets of records
• Hash function: uses a key to generate an index to insert at in hash table
• Collision: when more than one record is mapped to the same index in the hash table
Hashing: Example
• Where do the animals eat?
• Records: each animal
• Keys: where each animal eats
Hashing DNA sequences
• Each l-mer can be translated into a binary string (A, T, C, G can be represented as 00, 01, 10, 11)
• After assigning a unique integer per l-mer it is easy to get all start locations of each l-mer in a genome
Hashing: Maximal Repeats
• To find repeats in a genome:
• For all l-mers in the genome, note the start position and the sequence
• Generate a hash table index for each unique l-mer sequence
• In each index of the hash table, store all genome start locations of the l-mer which generated that index
• Extend l-mer repeats to maximal repeats
Hashing: Collisions
• Dealing with collisions:
• “Chain” all start locations of l-mers (linked list)
Hashing: Summary
• When finding genomic repeats from l-mers:
• Generate a hash table index for each l-mer sequence
• In each index, store all genome start locations of the l-mer which generated that index
• Extend l-mer repeats to maximal repeats
Pattern Matching
• What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern?
• This leads us to a different problem, the Pattern MatchingProblem
Pattern Matching Problem
• Goal: Find all occurrences of a pattern in a text
• Input: Pattern p = p1…pn and text t = t1…tm
• Output: All positions 1<i< (m – n + 1) such that the n-letter substring of t starting at i matches p
• Motivation: Searching database for a known pattern
Exact Pattern Matching: A Brute-Force Algorithm

PatternMatching(p,t)

• n length of pattern p
• m length of text t
• for i  1 to (m – n + 1)
• if ti…ti+n-1 = p
• outputi
Exact Pattern Matching: An Example

GCAT

• PatternMatching algorithm for:
• Pattern GCAT
• Text CGCATC

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

Exact Pattern Matching: Running Time
• PatternMatching runtime: O(nm)
• Probability-wise, it’s more like O(m)
• Rarely will there be close to n comparisons in line 4
• Better solution: suffix trees
• Can solve problem in O(m) time
• Conceptually related to keyword trees
Keyword Trees: Example
• Keyword tree:
• Apple
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
• Orange
Keyword Trees: Properties
• Stores a set of keywords in a rooted labeled tree
• Each edge labeled with a letter from an alphabet
• Any two edges coming out of the same vertex have distinct labels
• Every keyword stored can be spelled on a path from root to some leaf
• appeal
• appeal
• appeal
• appeal
Multiple Pattern Matching Problem
• Goal: Given a set of patterns and a text, find all occurrences of any of patterns in text
• Input: k patterns p1,…,pk, and text t = t1…tm
• Output: Positions 1 <i <m where substring of t starting at i matches pj for 1 <j<k
• Motivation: Searching database for known multiple patterns
Multiple Pattern Matching: Straightforward Approach
• Can solve as k “Pattern Matching Problems”
• Runtime:

O(kmn)

using the PatternMatching algorithm k times

• m - length of the text
• n - average length of the pattern
Multiple Pattern Matching: Keyword Tree Approach
• Or, we could use keyword trees:
• Build keyword tree in O(N) time; N is total length of all patterns
• With naive threading: O(N + nm)
• Aho-Corasick algorithm: O(N + m)
• To match patterns in a text using a keyword tree:
• Build keyword tree of patterns
• “Thread” the text through the keyword tree
• Threading is “complete” when we reach a leaf in the keyword tree
• When threading is “complete,” we’ve found a pattern in the text
Suffix Trees=Collapsed Keyword Trees
• Similar to keyword trees, except edges that form paths are collapsed
• Each edge is labeled with a substring of a text
• All internal edges have at least two outgoing edges
• Leaves labeled by the index of the pattern.
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

How much time does it take?

Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text

• Suffix trees of a text is constructed for all its suffixes
• Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

linear (Weiner suffix tree algorithm)

Use of Suffix Trees
• Suffix trees hold all suffixes of a text
• i.e., ATCGC: ATCGC, TCGC, CGC, GC, C
• Builds in O(m) time for text of length m
• To find any pattern of length n in a text:
• Build suffix tree for text
• Thread the pattern through the suffix tree
• Can find pattern in text in O(n) time!
• O(n + m) time for “Pattern Matching Problem”
• Build suffix tree and lookup pattern
Pattern Matching with Suffix Trees

SuffixTreePatternMatching(p,t)

• Build suffix tree for text t
• Thread pattern p through suffix tree
• output positions of all p-matching leaves in the tree
• else
• output “Pattern does not appear in text”
Multiple Pattern Matching: Summary
• Keyword and suffix trees are used to find patterns in a text
• Keyword trees:
• Build keyword tree of patterns, and thread text through it
• Suffix trees:
• Build suffix tree of text, and thread patterns through it
Approximate vs. Exact Pattern Matching
• So far all we’ve seen exact pattern matching algorithms
• Usually, because of mutations, it makes much more biological sense to find approximate pattern matches
• Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches
Heuristic Similarity Searches
• Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow
• Alignment of two sequences usually has short identical or highly similar fragments
• Many heuristic methods (i.e., FASTA) are based on the same idea of filtration
• Find short exact matches, and use them as seeds for potential match extension
• “Filter” out positions with no extendable matches
Dot Matrices
• Dot matrices show similarities between two sequences
• FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)
Dot Matrices (cont’d)
• Identify diagonals above a threshold length
• Diagonals in the dot matrix indicate exact substring matching
Diagonals in Dot Matrices
• Extend diagonals and try to link them together, allowing for minimal mismatches/indels
• Linking diagonals reveals approximate matches over longer substrings
Approximate Pattern Matching Problem
• Goal: Find all approximate occurrences of a pattern in a text
• Input: A pattern p = p1…pn, text t = t1…tm, and k, the maximum number of mismatches
• Output: All positions 1 <i< (m – n +1) such that ti…ti+n-1 and p1…pn have at most k mismatches (i.e., Hamming distance between ti…ti+n-1 and p <k)
Approximate Pattern Matching: A Brute-Force Algorithm

ApproximatePatternMatching(p, t, k)

• n length of pattern p
• m  length of text t
• fori  1 to m – n + 1
• dist  0
• forj  1 to n
• ifti+j-1 != pj
• dist dist + 1
• ifdist<k
• outputi
Approximate Pattern Matching: Running Time
• That algorithm runs in O(nm).
• Landau-Vishkin algorithm: O(kn)
• We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”:
• We want to match substrings in a query to substrings in a text with at most k mismatches
• Motivation: we want to see similarities to some gene, but we may not know which parts of the gene to look for
Query Matching Problem
• Goal: Find all substrings of the query that approximately match the text
• Input: Query q = q1…qw,

text t = t1…tm,

n (length of matching substrings),

k (maximum number of mismatches)

• Output: All pairs of positions (i, j) such that the

n-letter substring of q starting at iapproximately matches the

n-letter substring of t starting at j,

with at most k mismatches

Query Matching: Main Idea
• Approximately matching strings share some perfectly matching substrings.
• Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).
Filtration in Query Matching
• We want all n-matches between a query and a text with up to k mismatches
• “Filter” out positions we know do not match between text and query
• Potential match detection: find all matches of l-tuples in query and text for some small l
• Potential match verification: Verify each potential match by extending it to the left and right, until (k + 1) mismatches are found
Filtration: Match Detection
• If x1…xn and y1…yn match with at most k mismatches, they must share an l-tuple that is perfectly matched, with l = n/(k + 1)
• Break string of length n into k+1 parts, each each of length n/(k + 1)
• k mismatches can affect at most k of these k+1 parts
• At least one of these k+1 parts is perfectly matched
Filtration: Match Detection(cont’d)
• Suppose k = 3. We would then have l=n/(k+1)=n/4:
• There are at most k mismatches in n, so at the very least there must be one out of the k+1 l–tuples without a mismatch

1…l

l+1…2l

2l+1…3l

3l+1…n

1

2

k

k+ 1

Filtration: Match Verification
• For each l -match we find, try to extend the match further to see if it is substantial

Extend perfect match of lengthl

until we find an approximate match of length n with k mismatches

query

text

Filtration: Example

Shorter perfectmatches required

Performance decreases

Local alignment is to slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
Local alignment is to slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
Local alignment is to slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
Local alignment is to slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
• Guaranteed to find the optimal local alignment
• Sets the standard for sensitivity
Local alignment is to slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
• Basic Local Alignment Search Tool
• Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.

Journal of Mol. Biol., 1990

• Search sequence databases for local alignments to a query
BLAST
• Great improvement in speed, with a modest decrease in sensitivity
• Minimizes search space instead of exploring entire search space between two sequences
• Finds short exact matches (“seeds”), only explores locally around these “hits”
What Similarity Reveals
• BLASTing a new gene
• Evolutionary relationship
• Similarity between protein function
• BLASTing a genome
• Potential genes
BLAST algorithm
• Keyword search of all words of length w from the query of length n in database of length m with score above threshold
• w = 11 for DNA queries, w =3 for proteins
• Local alignment extension for each found keyword
• Extend result until longest match above threshold is achieved
• Running time O(nm)
BLAST algorithm (cont’d)

keyword

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

GVK 18

GAK 16

GIK 16

GGK 14

GLK 13

GNK 12

GRK 11

GEK 11

GDK 11

Neighborhood

words

neighborhood

score threshold

(T = 13)

extension

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60

+++DN +G + IR L G+K I+ L+ E+ RG++K

High-scoring Pair (HSP)

Original BLAST
• Dictionary
• All words of length w
• Alignment
• Ungapped extensions until score falls below some statistical threshold
• Output
• All local alignments with score > threshold
Original BLAST: Example

A C G A A G T A A G G T C C A G T

• w = 4
• Exact keyword match of GGTC
• Extend diagonals with mismatches until score is under 50%
• Output result
• GTAAGGTCC
• GTTAGGTCC

C T G A T C C T G G A T T G C G A

From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST : Example

A C G A A G T A A G G T C C A G T

• Original BLAST exact keyword search, THEN:
• Extend with gaps around ends of exact matchuntil score < threshold
• Output result

GTAAGGTCCAGT

GTTAGGTC-AGT

C T G A T C C T G G A T T G C G A

From lectures by Serafim Batzoglou (Stanford)

Incarnations of BLAST
• blastn: Nucleotide-nucleotide
• blastp: Protein-protein
• blastx: Translated query vs. protein database
• tblastn: Protein query vs. translated database
• tblastx: Translated query vs. translated

database (6 frames each)

Incarnations of BLAST (cont’d)
• PSI-BLAST
• Find members of a protein family or build a custom position-specific score matrix
• Megablast:
• Search longer sequences with fewer differences
• WU-BLAST: (Wash U BLAST)
Assessing sequence similarity
• Need to know how strong an alignment can be expected from chance alone
• “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model
• Sequence models may take into account:
• G+C content
• Poly-A tails
• “Junk” DNA
• Codon bias
• Etc.
BLAST: Segment Score
• BLAST uses scoring matrices (d) to improve on efficiency of match detection
• Some proteins may have very different amino acid sequences, but are still similar
• For any two l-mers x1…xl and y1…yl :
• Segment pair: pair of l-mers, one from each sequence
• Segment score: Sli=1d(xi, yi)
BLAST: Locally Maximal Segment Pairs
• A segment pair is maximal if it has the best score over all segment pairs
• A segment pair is locally maximal if its score can’t be improved by extending or shortening
• Statistically significant locally maximal segment pairs are of biological interest
• BLAST finds all locally maximal segment pairs with scores above some threshold
• A significantly high threshold will filter out some statistically insignificant matches
BLAST: Statistics
• Threshold: Altschul-Dembo-Karlin statistics
• Identifies smallest segment score that is unlikely to happen by chance
• # matches above q has mean E(q) = Kmne-lq; K is a constant, m and n are the lengths of the two compared sequences
• Parameter l is positive root of:

Sx,y in A(pxpyed(x,y)) = 1, where px and py are frequenceies of amino acids x and y, and A is the twenty letter amino acid alphabet

P-values
• The probability of finding b HSPs with a score ≥S is given by:
• (e-EEb)/b!
• For b = 0, that chance is:
• e-E
• Thus the probability of finding at least one HSP with a score ≥S is:
• P = 1 – e-E
Sample BLAST output
• Blast of human beta globin protein against zebra fish

Score E

Sequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44

gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44

gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44

gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS

>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]

Length = 148

Score = 171 bits (434), Expect = 3e-44

Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60

MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK

Sbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120

V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147

+ F VQ A+QK +A V +AL +YH

Sample BLAST output (cont’d)
• Blast of human beta globin DNA against human DNA

Score E

Sequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75

gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3\' end 289 1e-75

gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72

gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66

gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34

gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS

>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region ([email protected]) on chromosome 11

Length = 81706

Score = 149 bits (75), Expect = 3e-33

Identities = 183/219 (83%)

Strand = Plus / Plus

Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326

|| ||| | || | || | |||||| ||||| ||||||||||| ||||||||

Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365

||||||||| |||||||||| ||||| ||||||||||||

Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

Timeline
• 1970: Needleman-Wunsch global alignment algorithm
• 1981: Smith-Waterman local alignment algorithm
• 1985: FASTA
• 1990: BLAST (basic local alignment search tool)
• 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve!
• Pattern Hunter
• BLAT

Length = k

Example (k = 11):

11111111111

Each 1 represents a “match”

PatternHunter: matches short non-consecutive sequences (spaced seed)

Increases sensitivity by locating homologies that would otherwise be missed

Example (a spaced seed of length 18 w/ 11 “matches”):

111010010100110111

Each 0 represents a “don’t care”, so there can be a match or a mismatch

PatternHunter: faster and even more sensitive
Example of a hit using a spaced seed:Spaced seeds

How does this result in better sensitivity?

Why is PH better?
• BLAST: redundant hits
• PatternHunter

This results in > 1 hit and creates clusters of redundant hits

This results in very few redundant hits

Why is PH better?

BLAST may also miss a hit

GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT

|| ||||||||| |||||| | |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

9 matches

In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit!

Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed

11 positions

11 positions

10 positions

Why is PH better?
• Higher hit probability
• Lower expected number of random hits
Use of Multiple Seeds

Basic Searching Algorithm

• Select a group of spaced seed models
• For each hit of each model, conduct extension to find a homology.
Another method: BLAT
• BLAT (BLAST-Like Alignment Tool)
• Same idea as BLAST - locate short sequence hits and extend
BLAT vs. BLAST: Differences
• BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database
• Index is stored in RAM which is memory intensive, but results in faster searches
BLAT: Fast cDNA Alignments

Steps:

1. Break cDNA into 500 base chunks.

2. Use an index to find regions in genome similar to each chunk of cDNA.

3. Do a detailed alignment between genomic regions and cDNA chunk.

4. Use dynamic programming to stitch together detailed alignments of chunks into detailed alignment of whole.

BLAT: Indexing
• An index is built that contains the positions of each k-mer in the genome
• Each k-mer in the query sequence is compared to each k-mer in the index
• A list of ‘hits’ is generated - positions in cDNA and in genome that match for k bases
Indexing: An Example

Here is an example with k = 3:

Genome:cacaattatcacgaccgc

3-mers (non-overlapping): cac aat tat cac gac cgc

Index: aat 3 gac 12

cac 0,9 tat 6

cgc 15

cDNA (query sequence): aattctcac

3-mers (overlapping): aat att ttc tct ctc tca cac

0 1 2 3 4 5 6

Hits: aat 0,3

cac 6,0

cac 6,9

clump: cacAATtatCACgaccgc

Multiple instances map to single index

Position of 3-mer in query, genome

However…
• BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments
PatternHunter and BLAT vs. BLAST
• PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity
• BLAT is several times faster than BLAST, but best results are limited to closely related sequences
Resources
• tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt
• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf
• http://www.genomeblat.com/genomeblat/blatRapShow.pps