COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA)

COT 6930HPC and BioinformaticsPairwise Sequence Alignment (PSA) Xingquan Zhu Dept. of Computer Science and Engineering

Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST

Difficulty of PSA • Consider two sequences of length n • Number of possible global alignments • For n = 40, there are over 1023 possible alignments • Approaches • “Optimal” solution • Dynamic programming • Heuristic solutions • Dot matrix plot • FASTA • BLAST

Dynamic Programming (DP) • Alignment • Problem can be subdivided • Optimal alignment of n bases • 1. Incorporates optimal alignment of 1…n-1 bases • 2. Combine with best alignment for nth base • Bioinformatics application • Global alignment [Needleman-Wunsch 1970] • Local alignment [Smith-Waterman 1981] • Complexity • O(n2) or O(n × m) algorithm (sequences of length n, m) • Feasible for moderate sized sequences, not entire genomes

Dynamic Programming (DP) • Dynamic programming usually consists of three components. • Recursive relation • Tabular computation • Trace back • This efficient recursive method is used to search through all possible alignments and finding the one with the optimal score.

How DP is used for alignment? • An alignment is represented as a path through a matrix. To search through the matrix of all possible paths and find the optimal path DP is used.

Four possible outcomes in aligning two sequences [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!)

Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ or XY- or XYZ ABCABC AB- (where “-” indicates a gap). So we can extend the match by determining which of these has the highest score.

Intuition of Dynamic Programming Considering sequences: “HEAGA” and “PA”

DP (Global Alignment) • Recursive formula • Notation • xi ith letter of string x • yj jth letter of string y • x1..i prefix of x from letters 1 through i • F matrix of optimal scores (DP Matrix) • F(i, j) represents optimal score lining up x 1..i with y1..j • d gap penalty • s scoring matrix

Constant vs. Non-constant Gap Penalty • Constant gap penalty • Non-constant gap penalty

DP (Global Alignment) • Algorithm • Initialize: F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) • Fill from top left to bottom right using recursive formula

DP Example • Sequences • Seq1: HEAGAWGHEE • Seq2: PAWHEAE • Gap Penalty: -8 (Constant) • Scoring Matrix (Blosum 50)

DP Example Algorithm Initialize F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50

DP Example (Traceback) • Trace arrows from bottom right to top left • Diagonal – both match • Up – left sequence match a gap • Or insert a gap to top sequence • Left – top sequence match a gap • Or insert a gap to left sequence

Global alignment vs. local alignment • Global alignment: the entire sequence of each protein or DNA sequence is contained in the alignment. • Local alignment: only regions of greatest similarity between two sequences are aligned percent identity: ~26% RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Global vs. local alignment • Global alignment are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Global methods are useful when you want to force two sequences to align over their entire length • Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences.

DP (local alignment) • Make 0 minimal score (i.e., start new alignment) • Alignment can start / end anywhere • Start at highest score(s) • End when 0 reached

DP (local alignment) Blosum 50

DP (Local Alignment) • Traceback • Start at highest score and trace arrows back to first 0

Multiple, Repeat, & Overlap Matches • Multiple alignments • “Optimal” global alignment may be best by small margin • Can report several high-scoring alignments • Repeat matches • Allow sections to match repeatedly • Find repeated patterns (domain / motif) • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends

Overlap Alignment Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: require alignment between the endpoints of the two sequences.

Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.

Overlap Alignment • Initialization:F(i,0)=0,F(0,j)=0 • Recurrence:as in global alignment • Score:maximum value at the bottom line and rightmost line

Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5

Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-

Dot Matrix Plot • Method • Put two sequences on vertical and horizontal axes of graph • Put dots wherever there is a match • Diagonal line is region of identity (local alignment) • Filter non-diagonals (minimal length, identity threshold)

Simple Dot Plot Example

The Dotplot is a Table for overview of the similarities between two sequences The palindromic sequence: MAXISTAYAWAYATSIXAM The stretches of similar residues show up as diagonals The palindromic sequence is very important part of DNA sequence. Many enzyme recognize these sequences

Dot Matrix Plot Example • Chimpanzee DNA aligned with itself • Filters needed to eliminate random matches filtered 6 of 10 filtered 8 of 10

Dot Matrix Plot • Advantage • Extremely simple • Can use scoring matrix • Can use multiple filters to rank matches • Example • Red > 50 bases, 95% identical (very close) • Blue > 25 bases, 75% identical (close) • Green > 10 bases, 50% identical (distant) • Limitations • O(n2) algorithm • Relies on visual analysis • Difficult to find “best” alignment • Despite multiple filters, difficult to compare alignments

Dot Matrix Plot • Demo • http://www.vivo.colostate.edu/molkit/dnadot/

G A A T T C A G T T A G 1 1 1 1 1 1 1 1 1 1 1 G 1 1 1 1 1 1 1 2 2 2 2 A 1 2 2 2 2 2 2 2 2 2 2 T 1 2 2 3 3 3 3 3 3 3 3 C 1 2 2 3 3 4 4 4 4 4 4 G 1 2 2 3 3 4 4 5 5 5 5 A 1 2 3 3 3 4 5 5 5 5 6 FASTA - Idea - • Problem of Dynamic Programming DP compute the score in a lot of useless area for optimal sequence • FASTA focuses on diagonal area

FASTA - Heuristic - • Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area

FASTA High Level Description Let q be a query max  0 For each sequence, s in DB compare q with s and compute a score, y if max < y max  y; bestSequence  s ; Return bestSequence

FASTA - Algorithm - • Step 1 Find all hot-spots // Hot spots is pairs of words of length k that exactly match k=4-6 for DNA, k=1-2 for Amino acid sequences Sequence 1 Hot Spots Sequence 2

G A A T T C A G T T A G * * Q Location G * * A 2,3,7,11 A * * * * C 6 T * * * * G 1,8 C * G * * T 4,5,9,10 A * * * * FASTA - Algorithm - • Step 1 in detail Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A Dot—Matrix Look-up Table

FASTA - Algorithm - • Step 2 Score the Hot-spot and locate the ten best diagonal run. // There is some scoring system; ex. PAM250

FASTA - Algorithm - • Step 3 Combine sub-alignments into one alignment with GAP GAP One of local alignment

FASTA - Algorithm - • Step 4 # Consider weighted direct graph. # Let node be a sub-alignment found in step 1 # Let u and v be nodes # Edge (u,v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) # Find the maximum weight path Sub-sequence Edge One Sequence

One of Sequence FASTA - Algorithm - GAP • Step 4 in detail Sub-alignment Gap -5 -3 -3 Max Weight Path

FASTA - Algorithm - • Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter

FASTA - Algorithm - • Summary of Algorithm 1: Find all hot-spots // Hot spots is pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use the dynamic programming in restricted area around the best-score alignment to find out the alignment greater than the best-score alignment.

FASTA - Complexity - • Complexity # Step 1 and 2 // select the best 10 diagonal run Let n be a sequence from DB O(n+m) because Step 1 just uses look up the table O(n+m) << O(mn) m,n = 100 to 200

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA)

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA)

Presentation Transcript

COT 6930 HPC and Bioinformatics Introduction to Molecular Biology

Pairwise Sequence Alignment

Pairwise Sequence Alignment

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

COT 6930 HPC and Bioinformatics Introduction to Molecular Biology

Pairwise sequence Alignment

COT 6930 HPC and Bioinformatics Protein Structure Prediction

COT 6930 HPC and Bioinformatics Sequence Alignment

Pairwise and multiple sequence alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence alignment