Algorithms for Pairwise Sequence Alignment

Algorithms for Pairwise Sequence Alignment Craig A. Struble, Ph.D. Marquette University

Overview • Pairwise Sequence Alignment • Dynamic Programming Solution • Global Alignment • Local Alignment • BLAST and FASTA MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Pairwise Sequence Alignment • As we’ve seen, sequence similarity is an indicator of homology • There are other uses for sequence similarity • Database queries • Comparative genomics • … MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Pairwise Sequence Alignment • Example • Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Scoring • To compare two sequence alignments, calculate a score • PAM or BLOSUM matrices • Matches and mismatches • Gap penalty • Initiating a gap • Gap extension penalty • Extending a gap MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Example • Gap penalty: -8 • Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Formal Description • Problem:PairSeqAlign • Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

How Difficult Is This? • Consider two sequences of length n • There are possible global alignments, and we need to find an optimal one from amongst those! MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

So what? • So at n = 20, we have over 120 billion possible alignments • We want to be able to align much, much longer sequences • Some proteins have 1000 amino acids • Genes can have several thousand base pairs MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Dynamic Programming • General algorithmic development technique • Reuses the results of previous computations • Store intermediate results in a table for reuse • Look up in table for earlier result to build from MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Global Alignment • Needleman-Wunsch 1970 • Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Global Alignment • Notation • xi – ith letter of string x • yj – jth letter of string y • x1..i – Prefix of x from letters 1 through I • F – matrix of optimal scores • F(i,j) represents optimal score lining up x1..i with y1..j • d – gap penalty • s – scoring matrix MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Global Alignment • The work is to build up F • Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd • Fill from top left to bottom right using the recursive relation MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Global Alignment yj aligned to gap Move ahead in both s(xi,yj) d d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Completed Table MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Traceback • Trace arrows back from the lower right to top left • Diagonal – both • Up – upper gap • Left – lower gap HEAGAWGHE-E --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Summary • Uses recursion to fill in intermediate results table • Uses O(nm) space and time • O(n2) algorithm • Feasible for moderate sized sequences, but not for aligning whole genomes. MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Local Alignment • Smith-Waterman (1981) • Another dynamic programming solution MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Traceback Start at highest score and traceback to first 0 AWGHE AW-HE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Summary • Similar to global alignment algorithm • For this to work, expected match with random sequence must have negative score. • Behavior is like global alignment otherwise • Similar extensions for repeated and overlap matching • Care must be given to gap penalties to maintain O(nm) time complexity MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Repeat and Overlap Matches • Repeat matches allow for sections of a sequence to match repeatedly • Repeated domain or motif • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends x x y y MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

BLAST • O(n2) algorithms are too slow for large scale searches • BLAST developed by Altschul et al (1990) • Uses probabilistic approach to searching • Idea: True alignments will have a short stretch of identities (perfect match) MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

BLAST Overview • Make a list of neighborhood words • Length 3 for proteins, 11 for nucleic acids • Match query with score higher than some threshold • Usually 2 bits per residue • Scans database for words • When a hit is obtained, extends the match in both direction as ungapped alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

FASTA • Pearson & Lipman (1988) • Find all matching words of length ktup • 1 or 2 for proteins, 4 or 6 for DNA • Look for diagonals supporting word matches • Extend with ungapped alignment • Join ungapped regions with gaps MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

Algorithms for Pairwise Sequence Alignment

Algorithms for Pairwise Sequence Alignment

Presentation Transcript

Developing Pairwise Sequence Alignment Algorithms

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)