310 likes | 445 Views
This document provides an overview of sequence alignment, a fundamental technique used in bioinformatics for comparing DNA, RNA, and protein sequences. It discusses the importance of alignment in gene identification, functional clues, and evolutionary studies. Examples illustrate alignment techniques with matching, mismatches, insertions, and deletions. The document covers scoring methods, edit distances, and the dynamic programming algorithm used to compute alignments efficiently. It concludes by mentioning local and global alignment algorithms and modern tools like BLAST for sequence analysis.
E N D
An Introduction to Sequence Alignment July 16, 2003 Sining Chen Expressionist Meeting
What? • DNA/RNA sequences: strings composed of an alphabet of 4 letters • Protein sequences: alphabet of 20 letters
Why? • Identify a gene • Find clues to gene function (ortholog?) • Find other organisms with this gene (homology) • Gather info for an evolutionary model • …
An Example • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignments • -GCGC-ATGGATTGAGCGA • TGCGCCATTGAT-GACC-A • Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)
Significant Similarity in an Alignment • Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated) • Ha: the sequences diverge from a common ancestor (related) • Test statistic: Ymax = length of the longest running perfect match subsequence
Exact Matching Subsequences • In DNA alignment, the matching probability • Under Ho lengths of exact match subseq Y should follow a geometric distribution
Well-matching Subsequences • Evolution may cause small differences to even sequences with a reasonably recent common ancestor. • We consider Ymax to be the longest subseq with up to k mismatches. • Y follow hyper-geometric distribution • P-value: exact/simulated/approximate (independence among Y does not hold any more)
Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?
Scoring Rule • Example Score = (# matches) – (# mismatches) – (# indels) x 2
Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
Edit Distance • The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other • Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance
Computing Edit Distance • How can we compute the edit distance?? • If |s| = n and |t| = m, there are more than alignments • 2 sequences each of length 1000: > 10^600 • The additive form of the score allows to perform dynamic programming to compute edit distance efficiently
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument Define the notation: • Using the recursive argument, we get the following recurrence for V:
Recursive Argument • Of course, we also need to handle the base cases in the recursion:
Dynamic ProgrammingAlgorithm We fill the matrix using the recurrence rule
Conclusion: d(AAAC,AGC) = -1 DynamicProgramming Algorithm
Reconstructing the Best Alignment • We now trace back the path the corresponds to the best alignment AAAC AG-C
Reconstructing the Best Alignment • Sometimes, more than one alignment has the best score AAAC A-GC
Complexity Space: O(mn) Time: O(mn) • Filling the matrix O(mn) • Backtrace O(m+n)
Other scoring schemes • Needleman and Wunsch: 1 for identical amino acid, 0 otherwise • Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919). • … • Different Gap Cost Function
Discussed above are global alignment algorithms • For local alignment algorithm: use score 0 as threshold as in: Smith, Waterman (1981), J. Mol. Biol., 147, 195-197
There exist other algorithms faster than dynamic programming under specific assumptions, e.g. BLAST, MegaBLAST • Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A greedy algorithm for aligning DNA sequences. J Comp Biol. 7:203-214. Assume only sequencing error to search for very similar sequences. MegaBLAST behind NCBI HomoloGene.
Multiple Global Alignment • CLUSTAL W (Thompson, Higgins, Gibson 1994, Nucleic Acids Res, 22, 4673-4680) • Ungapped local alignments: Lawrence et.al. 1993, Science 262, 208-214 • Gapped local alignments: Zhu et.al., 1998, Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25-39
BLAST • Search a sequence database for fragments similar to the query sequence • Compile a list of high-scoring short words shared by the query sequence and the database; • Scan the database for “hits” • Expand the “hits” to MSP (maximum segment pair = a pair of equal-length/no-gap segments with the highest alignment score)
BLAST • Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. • Variations of BLAST designed for specific purposes • http://www.ncbi.nlm.nih.gov/BLAST/