1 / 49

Sequence Alignments and Dynamic Programming

BIO/CS 471 – Algorithms for Bioinformatics. Sequence Alignments and Dynamic Programming. Module II: Sequence Alignments. Problem: SequenceAlignment Input: Two or more strings of characters

sandro
Download Presentation

Sequence Alignments and Dynamic Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIO/CS 471 – Algorithms for Bioinformatics Sequence Alignmentsand Dynamic Programming

  2. Module II: Sequence Alignments • Problem: SequenceAlignment • Input: Two or more strings of characters • Output: The optimal alignment of the input strings, possibly including gaps, and an alignment score. • Example • Input: The calico cat was very sleepy. The cat was sleepy. • Output: The calico cat was very sleepy. The ------ cat was ---- sleepy. Sequence Alignments

  3. Biological Sequence Alignment • Input: DNA or Amino Acid sequence • Output: Optimal alignment • Example • Input: ACTATAGCTATAGCTATAGCGTATCG ACGATTACAGGTCGTGTCG • Output: ACTATAGCTATAGCTATAGCGTATCG || || ||*|| | |||*||| ACGAT---TACAGGT----CGTGTCG • Score: +8 Sequence Alignments

  4. Why align sequences? • Why would we want to align two sequences? • Compare two genes/proteins, e.g. to infer function • Database searches • Protein structure prediction • Why would we want to align multiple sequences? • Conservation analysis • Molecular evolution Sequence Alignments

  5. How do we decide on a scoring function? • Objective: Sequences that are homologous or evolutionarily related, should align with high scores. Less related sequences should score lower. • Question: How do homologous sequences arise? Sequence Alignments

  6. Transcription The Central Dogma DNA transcription  RNA translation  Proteins Sequence Alignments

  7. DNA Replication • Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set • DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Sequence Alignments

  8. Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations

  9. Deletions • Codon deletion:ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation:ACG ATA GCG TAT GTA TAG CCG…ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal Sequence Alignments

  10. Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT Sequence Alignments

  11. The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU  GAA Sequence Alignments

  12. Two kinds of homologs • Orthologs • Species A has a particular gene • Species A diverges into species B and C, each with the same gene • The two genes accumulate substitutions independently Sequence Alignments

  13. Orthologs and Paralogs • Paralogs • A gene duplication event within a species results in two copies of a gene • One of the copies is now under less selective constraint • The copies can accumulate substitutions independently, before and after speciation Sequence Alignments

  14. Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11 Sequence Alignments

  15. Origination and length penalties • We want to find alignments that are evolutionarily likely. • Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT • We can achieve this by penalizing more for a new gap, than for extending an existing gap   Sequence Alignments

  16. Scoring a sequence alignment (2) • Match/mismatch score: +1/+0 • Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Origination: 2 × (–2) • Length: 7 × (–1) Score = +7 Sequence Alignments

  17. Optimal Substructure in Alignments • Consider the alignment:ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Is it true that the alignment in the boxed region must be optimal? Sequence Alignments

  18. A Greedy Strategy • Consider this pair of sequencesGAGCCAGC • Greedy Approach:G or G or -C - G • Leads toGAGC--- Better: GACG---CAGC CACG GAP = 1 Match = +1 Mismatch = 2 Sequence Alignments

  19. A +1 CTCGA CAGTAGA -1 CTCG- ACAGTAG- -1 ACTCGA CAGTAG Breaking apart the problem • Suppose we are aligning:ACTCGACAGTAG • First position choices: Sequence Alignments

  20. A Recursive Approach to Alignment • Choose the best alignment based on these three possibilities: align(seq1, seq2) { if (both sequences empty) {return 0;} if (one string empty) { return(gapscore * num chars in nonempty seq); else { score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2)); score2 = align(tail(seq1), seq2) + gapscore; score3 = align(seq1, tail(seq2) + gapscore; return(min(score1, score2, score3)); } } } Sequence Alignments

  21. Time Complexity of RecurseAlign • What is the recurrence equation for the time needed by RecurseAlign? 3 3 n 3 3 3 9 … 27 3 3 3n Sequence Alignments

  22. RecurseAlign repeats its work Sequence Alignments

  23. How can we find an optimal alignment? • Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT • C(27,7) gap positions = ~888,000 possibilities • It’s possible, as long as we don’t repeat our work! • Dynamic programming: The Needleman & Wunsch algorithm Sequence Alignments

  24. What is the optimal alignment? • ACTCGACAGTAG • Match: +1 • Mismatch: 0 • Gap: –1 Sequence Alignments

  25. Needleman-Wunsch: Step 1 • Each sequence along one axis • Mismatch penalty multiples in first row/column • 0 in [1,1] (or [0,0] for the CS-minded) Sequence Alignments

  26. Needleman-Wunsch: Step 2 • Vertical/Horiz. move: Score + (simple) gap penalty • Diagonal move: Score + match/mismatch score • Take the MAX of the three possibilities Sequence Alignments

  27. Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise… Sequence Alignments

  28. Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise… • The optimal alignment score is calculated in the lower-right corner Sequence Alignments

  29. But what is the optimal alignment • To reconstruct the optimal alignment, we must determine of where the MAX at each step came from… Sequence Alignments

  30. A path corresponds to an alignment • = GAP in top sequence • = GAP in left sequence • = ALIGN both positions • One path from the previous table: • Corresponding alignment (start at the end):AC--TCG ACAGTAG Score = +2 Sequence Alignments

  31. Practice Problem • Find an optimal alignment for these two sequences: GCGGTT GCGT • Match: +1 • Mismatch: 0 • Gap: –1 Sequence Alignments

  32. Practice Problem • Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTTGCG-T- Score = +2 Sequence Alignments

  33. Semi-global alignment • Suppose we are aligning:GCGGGCG • Which do you prefer?G-CG -GCGGGCG GGCG • Semi-global alignment allows gaps at the ends for free. Sequence Alignments

  34. Semi-global alignment • Semi-global alignment allows gaps at the ends for free. • Initialize first row and column to all 0’s • Allow free horizontal/vertical moves in last row and column Sequence Alignments

  35. Local alignment • Global alignments – score the entire alignment • Semi-global alignments – allow unscored gaps at the beginning or end of either sequence • Local alignment – find the best matching subsequence • CGATGAAATGGA • This is achieved by allowing a 4th alternative at each position in the table: zero. Sequence Alignments

  36. Local alignment • Mismatch = –1 this time CGATGAAATGGA Sequence Alignments

  37. Food for thought… • What is the asymptotic time complexity of the Needleman & Wunsch algorithm? • Is this good or bad? • How about the space complexity? • Why might this be a problem? Sequence Alignments

  38. Saving Space • Note that we can throw away the previous rows of the table as we fill it in: This row is based only on this one Sequence Alignments

  39. Saving Space (2) • Each row of the table contains the scores for aligning a prefix of the left-hand sequence with all prefixes of the top sequence: Scores for aligning aca with all prefixes of actcg Sequence Alignments

  40. Divide and Conquer • By using a recursive approach, we can use only two rows of the matrix at a time: • Choose the middle character of the top sequence, i • Find out where i aligns to the bottom sequence • Needs two vectors of scores • Recursively align the sequences before and after the fixed positions i ACGCTATGCTCATAG CGACGCTCATCG Sequence Alignments

  41. Finding where i lines up • Find out where i aligns to the bottom sequence • Needs two vectors of scores • Assuming i lines up with a character:alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t) + align(CTCATAG, suffix(t)) • Which character is best? • Can quickly find out the score for aligning ACGCTAT with every prefix of t. i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments

  42. Finding where i lines up • But, i may also line up with a gap • Assuming i lines up with a gap:alignscore = align(ACGCTAT, prefix(t)) + gapscore + align(CTCATAG, suffix(t)) i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments

  43. Recursive Call • Fix the best position for I • Call align recursively for the prefixes and suffixes: i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments

  44. Complexity • Let len(s) = m and len(t) = n • Space: 2m • Time: • Each call to build similarity vector = m´n´ • First call + recursive call: i s: ACGCTATGCTCATAG t: CGACGCTCATCG j Sequence Alignments

  45. G - - G or General Gap Penalties • Suppose we are no longer using simple gap penalties: • Origination = −2 • Length = −1 • Consider the last position of the alignment for ACGTA with ACG • We can’t determine the score forunless we know the previous positions! Sequence Alignments

  46. A A C --- A TATCCG A C T AC A C T ACC T ------ C G C -- Scoring Blocks • Now we must score a block at a time • A block is a pair of characters, or a maximal group of gaps paired with characters • To score a position, we need to either start a new block or add it to a previous block Sequence Alignments

  47. The Algorithm • Three tables • a – scores for alignments ending in char-char blocks • b – scores for alignments ending in gaps in the top sequence (s) • c – scores for alignments ending in gaps in the left sequence (t) • Scores no longer depend on only three positions, because we can put any number of gaps into the last block Sequence Alignments

  48. The Recurrences Sequence Alignments

  49. The Optimal Alignment • The optimal alignment is found by looking at the maximal value in the lower right of all three arrays • The algorithm runs in O(n3) time • Uses O(n2) space Sequence Alignments

More Related