1 / 26

Algorithms for Pairwise Sequence Alignment

Algorithms for Pairwise Sequence Alignment. Craig A. Struble, Ph.D. Marquette University. Overview. Pairwise Sequence Alignment Dynamic Programming Solution Global Alignment Local Alignment BLAST and FASTA. Pairwise Sequence Alignment.

eileen
Download Presentation

Algorithms for Pairwise Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Pairwise Sequence Alignment Craig A. Struble, Ph.D. Marquette University

  2. Overview • Pairwise Sequence Alignment • Dynamic Programming Solution • Global Alignment • Local Alignment • BLAST and FASTA MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  3. Pairwise Sequence Alignment • As we’ve seen, sequence similarity is an indicator of homology • There are other uses for sequence similarity • Database queries • Comparative genomics • … MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  4. Pairwise Sequence Alignment • Example • Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  5. Scoring • To compare two sequence alignments, calculate a score • PAM or BLOSUM matrices • Matches and mismatches • Gap penalty • Initiating a gap • Gap extension penalty • Extending a gap MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  6. Example • Gap penalty: -8 • Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  7. Formal Description • Problem:PairSeqAlign • Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  8. How Difficult Is This? • Consider two sequences of length n • There are possible global alignments, and we need to find an optimal one from amongst those! MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  9. So what? • So at n = 20, we have over 120 billion possible alignments • We want to be able to align much, much longer sequences • Some proteins have 1000 amino acids • Genes can have several thousand base pairs MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  10. Dynamic Programming • General algorithmic development technique • Reuses the results of previous computations • Store intermediate results in a table for reuse • Look up in table for earlier result to build from MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  11. Global Alignment • Needleman-Wunsch 1970 • Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  12. Global Alignment • Notation • xi – ith letter of string x • yj – jth letter of string y • x1..i – Prefix of x from letters 1 through I • F – matrix of optimal scores • F(i,j) represents optimal score lining up x1..i with y1..j • d – gap penalty • s – scoring matrix MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  13. Global Alignment • The work is to build up F • Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd • Fill from top left to bottom right using the recursive relation MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  14. Global Alignment yj aligned to gap Move ahead in both s(xi,yj) d d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  15. Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  16. Completed Table MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  17. Traceback • Trace arrows back from the lower right to top left • Diagonal – both • Up – upper gap • Left – lower gap HEAGAWGHE-E --P-AW-HEAE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  18. Summary • Uses recursion to fill in intermediate results table • Uses O(nm) space and time • O(n2) algorithm • Feasible for moderate sized sequences, but not for aligning whole genomes. MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  19. Local Alignment • Smith-Waterman (1981) • Another dynamic programming solution MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  20. Example MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  21. Traceback Start at highest score and traceback to first 0 AWGHE AW-HE MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  22. Summary • Similar to global alignment algorithm • For this to work, expected match with random sequence must have negative score. • Behavior is like global alignment otherwise • Similar extensions for repeated and overlap matching • Care must be given to gap penalties to maintain O(nm) time complexity MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  23. Repeat and Overlap Matches • Repeat matches allow for sections of a sequence to match repeatedly • Repeated domain or motif • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends x x y y MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  24. BLAST • O(n2) algorithms are too slow for large scale searches • BLAST developed by Altschul et al (1990) • Uses probabilistic approach to searching • Idea: True alignments will have a short stretch of identities (perfect match) MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  25. BLAST Overview • Make a list of neighborhood words • Length 3 for proteins, 11 for nucleic acids • Match query with score higher than some threshold • Usually 2 bits per residue • Scans database for words • When a hit is obtained, extends the match in both direction as ungapped alignment MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

  26. FASTA • Pearson & Lipman (1988) • Find all matching words of length ktup • 1 or 2 for proteins, 4 or 6 for DNA • Look for diagonals supporting word matches • Extend with ungapped alignment • Join ungapped regions with gaps MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

More Related