Loading in 5 sec....

Combinatorial Pattern MatchingPowerPoint Presentation

Combinatorial Pattern Matching

- 95 Views
- Uploaded on
- Presentation posted in: General

Combinatorial Pattern Matching

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Combinatorial Pattern Matching

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Example of repeats:
- ATGGTCTAGGTCCTAGTGGTC

- Motivation to find them:
- Genomic rearrangements are often associated with repeats
- Trace evolutionary secrets
- Many tumors are characterized by an explosion of repeats
- ~50% of human genome composed of repeats

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- The problem is often more difficult:
- ATGGTCTAGGACCTAGTGTTC

- Motivation to find them:
- Genomic rearrangements are often associated with repeats
- Trace evolutionary secrets
- Many tumors are characterized by an explosion of repeats
- ~50% of human genome composed of repeats

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Long repeats are difficult to find
- Short repeats are easy to find (e.g., hashing)
- Simple approach to finding long repeats:
- Find exact repeats of short l-mers. lis usually 10 to 13; for an n-element sequence, table would need 4l bins with ~ n/4l elements/bin (i.e., because there are n-l + 1 l -mers per sequence)
- Use l-mer repeats to potentially extend into longer, maximal repeats

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- There are typically many locations where an l-mer is repeated:
GCTTACAGATTCAGTCTTACAGATGGT

- The 4-mer TTAC starts at locations 3 and 17

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

GCTTACAGATTCAGTCTTACAGATGGT

- Extend these 4-mer matches:
GCTTACAGATTCAGTCTTACAGATGGT

- Maximal repeat: TTACAGAT

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- To find maximal repeats in this way, we need ALL start locations of all l-mers in the genome
- Hashing lets us find repeats quickly in this manner

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- What does hashing do?
- For different data, generate a unique integer
- Store data in an array at the unique integer index generated from the data

- Hashing is a very efficient way to store and retrieve data

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Hash table: array used in hashing
- Records: data stored in a hash table
- Keys: identifies sets of records
- Hash function: uses a key to generate an index to insert at in hash table
- Collision: when more than one record is mapped to the same index in the hash table

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Each l-mer can be translated into a binary string (A, C, G, Tcan be represented as 00, 01, 10, 11)
- After assigning a unique integer per l-mer it is easy to get all start locations of each l-mer in a genome

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- To find repeats in a genome:
- For all l-mers in the genome, note the start position and the sequence
- Generate a hash table index for each unique l-mer sequence
- In each index of the hash table, store all genome start locations of the l-mer which generated that index
- Extend l-mer repeats to maximal repeats

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Dealing with collisions:
- “Chain” all start locations of l-mers (linked list)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern?
- This leads us to a different problem, the Pattern MatchingProblem

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Goal: Find all occurrences of a pattern in a text
- Input: Pattern p = p1…pn and text t = t1…tm
- Output: All positions 1<i< (m – n + 1) such that the n-letter substring of t starting at i matches p
- Motivation: Searching database for a known pattern

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

PatternMatching(p,t)

- n length of pattern p
- m length of text t
- for i 1 to (m – n + 1)
- if ti…ti+n-1 = p
- outputi

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

GCAT

- PatternMatching algorithm for:
- Pattern GCAT
- Text CGCATC

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- PatternMatching runtime: O(nm)
- Probability-wise, it’s more like O(m)
- Rarely will there be close to n comparisons in line 4 (e.g., consider looking for AAAAT in a string consisting of 106 A’s…). Note: probability for x character types that the first character matches is 1/x; probability that it doesn’t is (x-1)/x. Probability that first matches and second doesn’t is (x-1)/x2. The probability that the first j characters match is 1/xj.

- Better solution: suffix trees
- Can solve problem in O(m) time
- Conceptually related to keyword trees

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword tree:
- Apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword tree:
- Apple
- Apropos

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword tree:
- Apple
- Apropos
- Banana

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword tree:
- Apple
- Apropos
- Banana
- Bandana

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword tree:
- Apple
- Apropos
- Banana
- Bandana
- Orange

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Stores a set of keywords in a rooted labeled tree
- Each edge labeled with a letter from an alphabet
- Any two edges coming out of the same vertex have distinct labels
- Every keyword stored can be spelled on a path from root to some leaf

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “appeal”
- appeal

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “appeal”
- appeal

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “appeal”
- appeal

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “appeal”
- appeal

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “apple”
- apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “apple”
- apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “apple”
- apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “apple”
- apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Search for “apple”
- apple

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Goal: Given a set of patterns and a text, find all occurrences of any of the patterns in the text
- Input: k patterns p1,…,pk, and text t = t1…tm
- Output: Positions 1 <i<m where substring of t starting at imatches pj for 1 <j<k
- Motivation: Searching database for known multiple patterns

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Can solve as k “Pattern Matching Problems”
- Runtime:
O(kmn)

using the PatternMatching algorithm k times

- m - length of the text
- n - average length of the pattern

- Runtime:

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Or, we could use keyword trees:
- Build keyword tree in O(N) time; N is total length of all patterns (i.e., k*n)
- With naive threading: O(N + nm)
- Aho-Corasick algorithm: O(N + m)

Example: Suppose m = 106, k=100, and n = 1000

O(kmn) 102*106*103 = 1011

O(N + nm) 102*103 + 103*106 = 105 + 109 ≈ 109

O(N + m) 102*103 + 106 = 105 + 106 ≈ 106

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- To match patterns in a text using a keyword tree:
- Build keyword tree of patterns
- “Thread” the text through the keyword tree

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Threading is “complete” when we reach a leaf in the keyword tree
- When threading is “complete,” we’ve found a pattern in the text

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Similar to keyword trees, except edges that form paths are collapsed
- Each edge is labeled with a substring of a text
- All internal vertices have at least two outgoing edges
- Leaves labeled by the index of the pattern.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

How much time does it take?

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

quadratic

Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suffix trees of a text is constructed for all its suffixes
- Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

quadratic

linear (Weiner suffix tree algorithm)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suffix trees hold all suffixes of a text
- i.e., ATCGC: ATCGC, TCGC, CGC, GC, C
- Builds in O(m) time for text of length m

- To find any pattern of length n in a text:
- Build suffix tree for text
- Thread the pattern through the suffix tree
- Can find pattern in text in O(n) time!

- O(n + m) time for “Pattern Matching Problem”
- Build suffix tree and lookup pattern

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

SuffixTreePatternMatching(p,t)

- Build suffix tree for text t
- Thread pattern p through suffix tree
- if threading is complete
- output positions of all p-matching leaves in the tree
- else
- output “Pattern does not appear in text”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

G

G

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword and suffix trees are used to find patterns in a text
- Keyword trees:
- Build keyword tree of patterns, and thread text through it

- Suffix trees:
- Build suffix tree of text, and thread patterns through it

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- So far all we’ve seen exact pattern matching algorithms
- Usually, because of mutations, it makes much more biological sense to find approximate pattern matches
- Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow
- Alignment of two sequences usually has short identical or highly similar fragments
- Many heuristic methods (i.e., FASTA) are based on the same idea of filtration
- Find short exact matches, and use them as seeds for potential match extension
- “Filter” out positions with no extendable matches

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Dot matrices show similarities between two sequences
- FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Identify diagonals above a threshold length
- Diagonals in the dot matrix indicate exact substring matching

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Extend diagonals and try to link them together, allowing for minimal mismatches/indels
- Linking diagonals reveals approximate matches over longer substrings

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Goal: Find all approximate occurrences of a pattern in a text
- Input: A pattern p = p1…pn, text t = t1…tm, and k, the maximum number of mismatches
- Output: All positions 1 <i< (m – n +1) such that ti…ti+n-1 and p1…pn have at most k mismatches (i.e., Hamming distance between ti…ti+n-1 and p <k)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

ApproximatePatternMatching(p, t, k)

- n length of pattern p
- m length of text t
- fori 1 to m – n + 1
- dist 0
- forj 1 to n
- ifti+j-1 != pj
- dist dist + 1
- ifdist<k
- outputi

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- That algorithm runs in O(nm).
- Landau-Vishkin algorithm: O(kn)
- We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”:
- We want to match substrings in a query to substrings in a text with at most k mismatches
- Motivation: we want to see similarities to some gene, but we may not know which parts of the gene to look for

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Goal: Find all substrings of the query that approximately match the text
- Input: Query q = q1…qr,
text t = t1…tm,

n (length of matching substrings),

k (maximum number of mismatches)

- Output: All pairs of positions (i, j) such that the
n-letter substring of q starting at iapproximately matches the

n-letter substring of t starting at j,

with at most k mismatches

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Approximately matching strings share some perfectly matching substrings.
- Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- We want all n-matches between a query and a text with up to k mismatches
- “Filter” out positions we know do not match between text and query
- Potential match detection: find all matches of l-tuples in query and text for some small l
- Potential match verification: Verify each potential match by extending it to the left and right, until (k + 1) mismatches are found

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Theorem: If x1…xn and y1…yn match with at most k mismatches, they share an l-merthat is perfectly matched, with l = n/(k + 1). That is, xi+1…xi+l= yi+1…yi+l for some 0 i n-l.

Proof: Partition the set of positions from 1 to ninto k+1 groups with n/(k + 1) positions in each group. Observe that kmismatches can affect at most k of these k+1 parts so at least one of these k+1 parts is perfectly matched

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Suppose k = 3. We would then have l=n/(k+1)=n/4:
- There are at most k mismatches in n, so at the very least there must be one out of the k+1 l–tuples without a mismatch
- Note: this gives us a way to select a value for l…
(or at least an upper bound)

1…l

l+1…2l

2l+1…3l

3l+1…n

1

2

k

k+ 1

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Potential match detection: Find all matches of l-mers in both query and text for l= n/(k+1) (e.g., use either hashing or a suffix tree).
- Potential match verification: Verify each potential match by extending it to the left and to the right until the first k+ 1 mismatches are found (or the beginning or end of the query or the text is found).

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- For each l -match we find, try to extend the match further to see if it is substantial

Extend perfect match of lengthl

until we find an approximate match of length n with k mismatches

query

text

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Shorter perfectmatches required

Performance decreases

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
- But, guaranteed to find the optimal local alignment
- Sets the standard for sensitivity

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
- Basic Local Alignment Search Tool
- Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.
Journal of Mol. Biol., 1990

- Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.
- Search sequence databases for local alignments to a query

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Great improvement in speed, with a modest decrease in sensitivity
- Minimizes search space instead of exploring entire search space between two sequences
- Finds short exact matches (“seeds”), only explores locally around these “hits”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- BLASTing a new gene
- Evolutionary relationship
- Similarity between protein function

- BLASTing a genome
- Potential genes

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Keyword search of all words of length w from the query of length n in database of length m with score above threshold
- w = 11 for DNA queries, w =3 for proteins

- Local alignment extension for each found keyword
- Extend result until longest match above threshold is achieved

- Running time O(nm)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

keyword

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

GVK 18

GAK 16

GIK 16

GGK 14

GLK 13

GNK 12

GRK 11

GEK 11

GDK 11

Neighborhood

words

neighborhood

score threshold

(T = 13)

extension

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60

+++DN +G + IR L G+K I+ L+ E+ RG++K

Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 264

High-scoring Pair (HSP)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Dictionary
- All words of length w

- Alignment
- Ungapped extensions until score falls below some statistical threshold

- Output
- All local alignments with score > threshold

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- w = 4
- Exact keyword match of GGTC
- Extend diagonals with mismatches until score is under 50%
- Output result
- GTAAGGTCC
- GTTAGGTCC

cf. Serafim Batzoglou lectures (Stanford)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Original BLAST exact keyword search, THEN:
- Extend with gaps around ends of exact matchuntil score < threshold
- Output result
- GTAAGGTCC-AGT
- GTTAGGTCCTAGT

cf. Serafim Batzoglou lectures (Stanford)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- blastn: Nucleotide-nucleotide
- blastp: Protein-protein
- blastx: Translated query vs. protein database
- tblastn: Protein query vs. translated database
- tblastx: Translated query vs. translated
database

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- PSI-BLAST
- Find members of a protein family or build a custom position-specific score matrix

- Megablast:
- Search longer sequences with fewer differences

- WU-BLAST: (Wash U BLAST)
- Optimized, added features

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Need to know how strong an alignment can be expected from chance alone
- “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model
- Sequence models may take into account:
- G+C content
- Poly-A tails
- “Junk” DNA
- Codon bias
- Etc.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Blast of human beta globin protein against zebra fish

Score E

Sequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44

gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44

gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44

gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS

>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]

Length = 148

Score = 171 bits (434), Expect = 3e-44

Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60

MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK

Sbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120

V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG

Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147

+ F VQ A+QK +A V +AL +YH

Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Blast of human beta globin DNA against human DNA

Score E

Sequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75

gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75

gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72

gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66

gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34

gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS

>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11

Length = 81706

Score = 149 bits (75), Expect = 3e-33

Identities = 183/219 (83%)

Strand = Plus / Plus

Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326

|| ||| | || | || | |||||| ||||| ||||||||||| ||||||||

Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365

||||||||| |||||||||| ||||| ||||||||||||

Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Expect (E) value

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.

The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect

E-values in blast results represent the probability of the alignment occurring by chance. It is a statistical calculation based on the quality of alignment (the score) and the size of the database. For example if an alignment obtained from one database has an E-value of x, the exact same alignment obtained from a database of different size will have an E-value of y.

An E-value of 1e-3 is saying that there is a 0.001 chance that that alignment would exist in the database by chance, that is, if the database contains 10000 sequences, then you might expect that alignment to occur maybe 10 times.

An E-value of 0 is actually a rounded down probability (maybe 1e-250 or something), and is simply saying that there is (almost) no chance that alignment can occur by chance.

The score is the measure of similarity between two sequences, and is calculated from the alignment matrix.

http://www.protocol-online.org/biology-forums/posts/5426.html

Hits and Misses

Sensitivity: how good is the search at finding relationships (even distant ones)?

Selectivity: are the relationships reported actually true relationships?

Consider the tradeoffs…

Errors

The error rate of a classification system is one measure of how well the system solves the problem for which it was designed. Other measures are possible, including speed and expense.

Definitions

The classifier makes a classification error whenever it classifies the input object as class Ci when the true class is class Cj; ij and CiCr.

The empirical error rate of a classification system is the number of errors made on independent test data divided by the number of classifications attempted.

The empirical reject rate of a classification system is the number of rejects made on independent test data divided by the number of classifications attempted.

Independent test data are sample objects with true class known, including objects from the reject class, that were not used in designing the feature extraction and classification algorithms.

Computer Vision (Shapiro and Stockman)

False Alarms and False Dismissals in Two-Class Problems

Assume that a system is attempting to identify which items belong to class A:

False alarm (false positive): placing an object in class A which does not belong to the class

False dismissal (false negative): failing to place an object in class A which does, in fact, belong to the class

Note that the cost of errors depends upon the problem. It may be necessary to bias the decision process so as to minimize false dismissals at the cost of increasing the number of false alarms.

Computer Vision

(Shapiro and Stockman)

Precision Versus Recall

The precision of a retrieval system is the number of relevant items (true C1) retrieved divided by the total number of items retrieved (true C1 plus false alarms actually from C2).

The recall of a retrieval system is the number of relevant items retrieved by the system divided by the total number of relevant items in the database. Equivalently, this is the number of true C1 items retrieved divided by the total of the true C1 items retrieved and the false dismissals.

Example:

Consider a database with 10,000 items, 400 of which are of item A. A specific user query is intended to retrieve the item A entries.

If the system retrieves 300 instances of item A and 200 instances of other unrelated items the precision of the system for this query is 300/(300+200) = .6 (i.e., 60%) and the recall is 300/400 = .75 (i.e., 75%)

Note that 100% recall could be obtained by returning all items in the database, but the precision would be abysmal (300/10000) = .03 (i.e., 3%). Similarly, higher precision could be achieved by tweaking the system for a low false alarm rate, but the recall would likely suffer.

Computer Vision (Shapiro and Stockman)

- 1970: Needleman-Wunsch global alignment algorithm
- 1981: Smith-Waterman local alignment algorithm
- 1985: FASTA
- 1990: BLAST (basic local alignment search tool)
- 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve!
- Pattern Hunter
- BLAT

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAST: matches short consecutive sequences (consecutive seed)

Length = k

Example (k = 11):

11111111111

Each 1 represents a “match”

PatternHunter: matches short non-consecutive sequences (spaced seed)

Increases sensitivity by locating homologies that would otherwise be missed

Example (a spaced seed of length 18 w/ 11 “matches”):

111010010100110111

Each 0 represents a “don’t care”, so there can be a match or a mismatch

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Example of a hit using a spaced seed:

How does this result in better sensitivity?

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- BLAST: redundant hits

- PatternHunter

This results in > 1 hit and creates clusters of redundant hits

This results in very few redundant hits

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAST may also miss a hit

GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT

|| ||||||||| |||||| | |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

9 matches

In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit!

Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

11 positions

11 positions

10 positions

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- Higher hit probability
- Lower expected number of random hits

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Basic Searching Algorithm

- Select a group of spaced seed models
- For each hit of each model, conduct extension to find a homology.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- BLAT (BLAST-Like Alignment Tool)
- Same idea as BLAST - locate short sequence hits and extend

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database
- Index is stored in RAM which is memory intensive, but results in faster searches

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity
- BLAT is several times faster than BLAST, but best results are limited to closely related sequences

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

- tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt
- http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf
- http://www.genomeblat.com/genomeblat/blatRapShow.pps

References

- Simons, Robert W. Advanced Molecular Genetics Course, UCLA (2002). http://www.mimg.ucla.edu/bobs/C159/Presentations/Benzer.pdf
- Batzoglou, S. Computational Genomics Course, Stanford University (2004). http://www.stanford.edu/class/cs262/handouts.html

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info