An Introduction to Sequence Alignment

An Introduction to Sequence Alignment July 16, 2003 Sining Chen Expressionist Meeting

What? • DNA/RNA sequences: strings composed of an alphabet of 4 letters • Protein sequences: alphabet of 20 letters

Why? • Identify a gene • Find clues to gene function (ortholog?) • Find other organisms with this gene (homology) • Gather info for an evolutionary model • …

An Example • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

Alignments • -GCGC-ATGGATTGAGCGA • TGCGCCATTGAT-GACC-A • Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)

Significant Similarity in an Alignment • Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated) • Ha: the sequences diverge from a common ancestor (related) • Test statistic: Ymax = length of the longest running perfect match subsequence

Exact Matching Subsequences • In DNA alignment, the matching probability • Under Ho lengths of exact match subseq Y should follow a geometric distribution

Well-matching Subsequences • Evolution may cause small differences to even sequences with a reasonably recent common ancestor. • We consider Ymax to be the longest subseq with up to k mismatches. • Y follow hyper-geometric distribution • P-value: exact/simulated/approximate (independence among Y does not hold any more)

Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?

Scoring Rule • Example Score = (# matches) – (# mismatches) – (# indels) x 2

Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

Edit Distance • The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other • Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance

Computing Edit Distance • How can we compute the edit distance?? • If |s| = n and |t| = m, there are more than alignments • 2 sequences each of length 1000: > 10^600 • The additive form of the score allows to perform dynamic programming to compute edit distance efficiently

Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )

Recursive Argument Define the notation: • Using the recursive argument, we get the following recurrence for V:

Recursive Argument • Of course, we also need to handle the base cases in the recursion:

Dynamic ProgrammingAlgorithm We fill the matrix using the recurrence rule

Conclusion: d(AAAC,AGC) = -1 DynamicProgramming Algorithm

Reconstructing the Best Alignment • We now trace back the path the corresponds to the best alignment AAAC AG-C

Reconstructing the Best Alignment • Sometimes, more than one alignment has the best score AAAC A-GC

Complexity Space: O(mn) Time: O(mn) • Filling the matrix O(mn) • Backtrace O(m+n)

Other scoring schemes • Needleman and Wunsch: 1 for identical amino acid, 0 otherwise • Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919). • … • Different Gap Cost Function

Scoringmatrix for protein sequences

Discussed above are global alignment algorithms • For local alignment algorithm: use score 0 as threshold as in: Smith, Waterman (1981), J. Mol. Biol., 147, 195-197

There exist other algorithms faster than dynamic programming under specific assumptions, e.g. BLAST, MegaBLAST • Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A greedy algorithm for aligning DNA sequences. J Comp Biol. 7:203-214. Assume only sequencing error to search for very similar sequences. MegaBLAST behind NCBI HomoloGene.

Multiple Global Alignment • CLUSTAL W (Thompson, Higgins, Gibson 1994, Nucleic Acids Res, 22, 4673-4680) • Ungapped local alignments: Lawrence et.al. 1993, Science 262, 208-214 • Gapped local alignments: Zhu et.al., 1998, Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25-39

BLAST • Search a sequence database for fragments similar to the query sequence • Compile a list of high-scoring short words shared by the query sequence and the database; • Scan the database for “hits” • Expand the “hits” to MSP (maximum segment pair = a pair of equal-length/no-gap segments with the highest alignment score)

BLAST • Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. • Variations of BLAST designed for specific purposes • http://www.ncbi.nlm.nih.gov/BLAST/

Advertisement…

An Introduction to Sequence Alignment

An Introduction to Sequence Alignment

Presentation Transcript

Introduction to Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment:

A Brief Introduction to Biological Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Introduction to the theory of sequence alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment

Introduction to the theory of sequence alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment