1 / 16

Sequence Alignment

Sequence Alignment. Bioinformatics. Sequence Comparison. Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity Edit distance (transforming S to T) Scoring mechanism

monikad
Download Presentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Bioinformatics

  2. Sequence Comparison • Problem: Given two sequences S & T, are S and T similar? • Need to establish some notion of similarity • Edit distance (transforming S to T) • Scoring mechanism • Related Problem: Given a target sequence, obtain sequences in a database that are similar to the target

  3. Edit Distance • Sequences S and T are strings over an alphabet (e.g.,{a,c,t,g}) • Edit operations (indels) • Insertion of a character • Deletion of a character • Example: need 3 indels to transform attc to tttac

  4. Alignment • We can model edit distance by aligning the two strings: -att-c t-ttac • An alignment of strings S and T is described by two strings S’ and T’ of the same length such that • S’ (T’) contains the characters of S (T) in order interspersed with spaces (-) • No position exists that contain spaces for both S’ and T’

  5. Gaps, Matches, and Mismatches • When comparing characters that occur in the same positions in S’ and T’, four possibilities arise • - in S’ -> insertion (gap) • - in T’ -> deletion (gap) • Characters match -> match • Characters don’t match -> mismatch • Can assign weights to each possibility (usually a positive number for matches, a negative number for gaps and mismatches)

  6. Scoring and Optimal Alignments • Given strings S and T, and an alignment (S’,T’), a score can be computed based on pre-established weights for gaps, matches, and mismatches • Add all the weights for each position in S’ and T’ • Note that there are many possible alignments for S and T • An optimal alignment for S and T is the alignment that yields the maximum score

  7. Problem Formulations for Sequence Comparison • Original Formulation: Given two sequences S & T, are S and T similar? • Revised Formulation: Given two sequences S & T, and weights for matches, gaps, and mismatches, determine the score of an optimal alignment of S & T

  8. Brute-force Algorithm Compare(S, T) generate all possible alignments for S and T for each alignment determine score return maximum score Note: This is an exponential algorithm due to the number of possible alignments for S and T

  9. An Edit Graph

  10. Edit Graphs are Alignments • Path from upper left corner to lower right corner represents an alignment • Vertical arrow: gap (deletion) • Horizontal arrow: gap (insertion) • Diagonal: match or mismatch • Alignment: AT-C-TGAT-TGCAT-A- • Score: (assume 5 for match, -2 for mismatch) –2+5+-2+5+-2+5+-2+5+-2 = 10

  11. Entries in an Edit Graph • Strategy: Fill up the intersections (green circles) with (running) scores based on the path traversed so far • Each circle can be computed according to results of at most three other values a + match/mismatch weight X = either b + gap weight c + gap weight a b c x

  12. Dynamic Programming Algorithm • Start with upper left corner (score 0) • Fill up top row and and leftmost column • Fill up succeeding rows using the formula • Resulting value on the lower right corner is the optimal score a + match/mismatch weight X = Max b + gap weight c + gap weight

  13. Algorithm Analysis • Let N be the lengths of S and T • Need to compute (N+1)(N+1) entries • O(N2) algorithm

  14. Determining the Actual Alignment • Need to remember which contributed to the computation of an entry (which resulting value was the maximum) • Perform a back-trace from lower right corner back to the upper left corner • Multiple optimal alignments possible because of ties

  15. Other Complexity Issues • When performing a search on a database, time complexity is dependent on the size D of the database since you run the algorithm on each sequence in the database: O(DN2) • Space requirement: an (N+1)(N+1) table • Can improve to 4N if we fill up the table according by “inverted Ls”. Topmost row and leftmost column first, then go by inner row and column, one stage at a time

  16. Variations • Scoring mechanism is driven by the weights for gaps, matches and mismatches • Can have different weights for starting a gap versus extending a gap (e.g., blastp and blastn) • Can have a table that allows different match/mismatch scores (e.g., BLOSUM)

More Related