310 likes | 671 Views
Algorithms for Pairwise Sequence Alignment. Craig A. Struble, Ph.D. Marquette University. Overview. Pairwise Sequence Alignment Dynamic Programming Solution Global Alignment Local Alignment BLAST and FASTA. Pairwise Sequence Alignment.
E N D
Algorithms for Pairwise Sequence Alignment Craig A. Struble, Ph.D. Marquette University
Overview • Pairwise Sequence Alignment • Dynamic Programming Solution • Global Alignment • Local Alignment • BLAST and FASTA MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Pairwise Sequence Alignment • As we’ve seen, sequence similarity is an indicator of homology • There are other uses for sequence similarity • Database queries • Comparative genomics • … MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Pairwise Sequence Alignment • Example • Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Scoring • To compare two sequence alignments, calculate a score • PAM or BLOSUM matrices • Matches and mismatches • Gap penalty • Initiating a gap • Gap extension penalty • Extending a gap MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Example • Gap penalty: -8 • Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Formal Description • Problem:PairSeqAlign • Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
How Difficult Is This? • Consider two sequences of length n • There are possible global alignments, and we need to find an optimal one from amongst those! MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
So what? • So at n = 20, we have over 120 billion possible alignments • We want to be able to align much, much longer sequences • Some proteins have 1000 amino acids • Genes can have several thousand base pairs MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Dynamic Programming • General algorithmic development technique • Reuses the results of previous computations • Store intermediate results in a table for reuse • Look up in table for earlier result to build from MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Global Alignment • Needleman-Wunsch 1970 • Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Global Alignment • Notation • xi – ith letter of string x • yj – jth letter of string y • x1..i – Prefix of x from letters 1 through I • F – matrix of optimal scores • F(i,j) represents optimal score lining up x1..i with y1..j • d – gap penalty • s – scoring matrix MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Global Alignment • The work is to build up F • Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd • Fill from top left to bottom right using the recursive relation MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Global Alignment yj aligned to gap Move ahead in both s(xi,yj) d d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Completed Table MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Traceback • Trace arrows back from the lower right to top left • Diagonal – both • Up – upper gap • Left – lower gap HEAGAWGHE-E --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Summary • Uses recursion to fill in intermediate results table • Uses O(nm) space and time • O(n2) algorithm • Feasible for moderate sized sequences, but not for aligning whole genomes. MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Local Alignment • Smith-Waterman (1981) • Another dynamic programming solution MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Traceback Start at highest score and traceback to first 0 AWGHE AW-HE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Summary • Similar to global alignment algorithm • For this to work, expected match with random sequence must have negative score. • Behavior is like global alignment otherwise • Similar extensions for repeated and overlap matching • Care must be given to gap penalties to maintain O(nm) time complexity MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
Repeat and Overlap Matches • Repeat matches allow for sections of a sequence to match repeatedly • Repeated domain or motif • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends x x y y MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
BLAST • O(n2) algorithms are too slow for large scale searches • BLAST developed by Altschul et al (1990) • Uses probabilistic approach to searching • Idea: True alignments will have a short stretch of identities (perfect match) MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
BLAST Overview • Make a list of neighborhood words • Length 3 for proteins, 11 for nucleic acids • Match query with score higher than some threshold • Usually 2 bits per residue • Scans database for words • When a hit is obtained, extends the match in both direction as ungapped alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
FASTA • Pearson & Lipman (1988) • Find all matching words of length ktup • 1 or 2 for proteins, 4 or 6 for DNA • Look for diagonals supporting word matches • Extend with ungapped alignment • Join ungapped regions with gaps MSCS 230: Bioinformatics I - Pairwise Sequence Alignment