CSE182-L4: Scoring matrices, Dictionary Matching

CSE182-L4: Scoring matrices, Dictionary Matching CSE 182

Class Mailing List • fa05_182@cs.ucsd.edu • To subscribe, send email to • fa05_182-subscribe@cs.ucsd.edu • You can subscribe from the course web page • Use the list for all course related queries, discussions,… CSE 182

Silly Quiz • Name a famous Bioinformatics Researcher • Name a famous Bioinformatics Researcher who is a woman CSE 182

Scoring DNA • DNA has structure. CSE 182

DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) CSE 182

Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA • Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. • “One size does not fit all” • Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant • Different proteins might evolve at different rates and we need to normalize for that CSE 182

PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] CSE 182

PAM1 matrix • Align many proteins that are very similar • Is this a problem? • PAM1 distance is the probability of a substitution when 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) CSE 182

PAM 1 CSE 182

1 PAM 1 PAM PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 2 PAM CSE 182

Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power? CSE 182

Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb) CSE 182

BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. • In practice BLOSUM62 seems to work very well. CSE 182

PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 CSE 182

Dictionary Matching, R.E. matching, and position specific scoring CSE 182

Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? database dictionary CSE 182

Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case CSE 182

Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. CSE 182

O A P M S T T T O T S I U A E The Trie Automaton • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 CSE 182

Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat An O(lpn) algorithm for keyword matching CSE 182

c l O A T P S M T T O T S I U A E Illustration: P O T A S T P O T A T O v 1 S CSE 182

Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match • Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j CSE 182

O A S T M P T T O T S I U A E Improving speed of dictionary matching • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 2 3 4 5 1 S 11 6 7 9 10 8 CSE 182

An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node CSE 182

Illustration P O T A S T P O T A T O l c 1 P O T A T O v T S S I U M A S T E CSE 182

Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O CSE 182

Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff CSE 182

Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute the bit-score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. CSE 182

Protein Sequence Analysis • What can you do if BLAST does not return a hit? • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher P-value. • This increases the probability that the sequence similarity is a chance event. • How can we get around this paradox? • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? CSE 182

CSE182-L4: Scoring matrices, Dictionary Matching

CSE182-L4: Scoring matrices, Dictionary Matching

Presentation Transcript

Matrices

Modified from Stanford CS276 slides Chap. 6: Scoring, Term Weighting and the Vector Space Model

Pattern Matching Algorithms: An Overview

Matrices

SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL

Unabridged Dictionary and Abridged Dictionary

Matching

Holistic Scoring

BLAST, PSI-BLAST and position-specific scoring matrices

CSE182-L11

Compressed Index for Dictionary Matching

CSE182-L6

CSE182-L10

CSE182-L6

CSE182-L6

CSE 538 MRS BOOK – CHAPTER VI Scoring, Term Weighting and the Vector Space Model

Scoring Matrices

Alignment and Algorithms

Property Matching and Weighted Matching

Hidden Markov Models What are the good for?

Special Matrices