1 / 31

An Introduction to Sequence Alignment

An Introduction to Sequence Alignment. July 16, 2003 Sining Chen Expressionist Meeting. What?. DNA/RNA sequences: strings composed of an alphabet of 4 letters Protein sequences: alphabet of 20 letters. Why?. Identify a gene Find clues to gene function (ortholog?)

tacita
Download Presentation

An Introduction to Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Sequence Alignment July 16, 2003 Sining Chen Expressionist Meeting

  2. What? • DNA/RNA sequences: strings composed of an alphabet of 4 letters • Protein sequences: alphabet of 20 letters

  3. Why? • Identify a gene • Find clues to gene function (ortholog?) • Find other organisms with this gene (homology) • Gather info for an evolutionary model • …

  4. An Example • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

  5. Alignments • -GCGC-ATGGATTGAGCGA • TGCGCCATTGAT-GACC-A • Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)

  6. Significant Similarity in an Alignment • Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated) • Ha: the sequences diverge from a common ancestor (related) • Test statistic: Ymax = length of the longest running perfect match subsequence

  7. Exact Matching Subsequences • In DNA alignment, the matching probability • Under Ho lengths of exact match subseq Y should follow a geometric distribution

  8. Well-matching Subsequences • Evolution may cause small differences to even sequences with a reasonably recent common ancestor. • We consider Ymax to be the longest subseq with up to k mismatches. • Y follow hyper-geometric distribution • P-value: exact/simulated/approximate (independence among Y does not hold any more)

  9. Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?

  10. Scoring Rule • Example Score = (# matches) – (# mismatches) – (# indels) x 2

  11. Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

  12. Edit Distance • The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other • Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance

  13. Computing Edit Distance • How can we compute the edit distance?? • If |s| = n and |t| = m, there are more than alignments • 2 sequences each of length 1000: > 10^600 • The additive form of the score allows to perform dynamic programming to compute edit distance efficiently

  14. Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )

  15. Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )

  16. Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )

  17. Recursive Argument Define the notation: • Using the recursive argument, we get the following recurrence for V:

  18. Recursive Argument • Of course, we also need to handle the base cases in the recursion:

  19. Dynamic ProgrammingAlgorithm We fill the matrix using the recurrence rule

  20. Conclusion: d(AAAC,AGC) = -1 DynamicProgramming Algorithm

  21. Reconstructing the Best Alignment • We now trace back the path the corresponds to the best alignment AAAC AG-C

  22. Reconstructing the Best Alignment • Sometimes, more than one alignment has the best score AAAC A-GC

  23. Complexity Space: O(mn) Time: O(mn) • Filling the matrix O(mn) • Backtrace O(m+n)

  24. Other scoring schemes • Needleman and Wunsch: 1 for identical amino acid, 0 otherwise • Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919). • … • Different Gap Cost Function

  25. Scoringmatrix for protein sequences

  26. Discussed above are global alignment algorithms • For local alignment algorithm: use score 0 as threshold as in: Smith, Waterman (1981), J. Mol. Biol., 147, 195-197

  27. There exist other algorithms faster than dynamic programming under specific assumptions, e.g. BLAST, MegaBLAST • Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A greedy algorithm for aligning DNA sequences. J Comp Biol. 7:203-214. Assume only sequencing error to search for very similar sequences. MegaBLAST behind NCBI HomoloGene.

  28. Multiple Global Alignment • CLUSTAL W (Thompson, Higgins, Gibson 1994, Nucleic Acids Res, 22, 4673-4680) • Ungapped local alignments: Lawrence et.al. 1993, Science 262, 208-214 • Gapped local alignments: Zhu et.al., 1998, Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25-39

  29. BLAST • Search a sequence database for fragments similar to the query sequence • Compile a list of high-scoring short words shared by the query sequence and the database; • Scan the database for “hits” • Expand the “hits” to MSP (maximum segment pair = a pair of equal-length/no-gap segments with the highest alignment score)

  30. BLAST • Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. • Variations of BLAST designed for specific purposes • http://www.ncbi.nlm.nih.gov/BLAST/

  31. Advertisement…

More Related