150 likes | 485 Views
Saurabh Sinha 02/05/2008 Department of Computer Science University of Illinois Urbana-Champaign Scribed By: Chandrasekar Ramachandran. Sequence Alignment – Scoring Functions, N-W and S-W Affine Gap Penalties. Contents. Introduction Interpretations Types of Alignments
E N D
Saurabh Sinha 02/05/2008 Department of Computer Science University of Illinois Urbana-Champaign Scribed By: Chandrasekar Ramachandran Sequence Alignment – Scoring Functions, N-W and S-W Affine Gap Penalties
Contents Introduction Interpretations Types of Alignments Techniques for Solving Dynamic Programming Probabilistic Methods Scoring Functions N-W and S-W Affine Gap Penalties
Introduction Sequence Alignment: Ways of Arranging one sequence(DNA,RNA,Protein) on another to determine whether a region has been conserved in evolution or has a common evolutionary origin Strings of Letters Matrix Representation: - G G C C A G G A T T G G G C C - G G - T T
Interpretations Mismatches? Point Mutations: Replacement of a Single Base Nucleotide Categorized as Transitions and Transversions Gaps? Indels or Insertion/Deletion Mutations Can produce Frameshift Mutations Unless Multiple of 3 Introduced in one or both lineages
Interpretations(Contd.) What about Amino Acids? Degree of Similarity Estimates Conservation If Conservation is Less: Indicates Region of High Importance Estimating Similar Functional Roles: By Assessing Similarity of Base Pairing
Solving Sequence Alignment Problems Dynamic Programming Initialization Matrix Fill or Scoring Traceback Probabilistic Methods Bayesian Methods for HMM Likelihood Derivatives and Fisher Scores Training and Model Comparison
Needleman-Wunsch Algorithm(Global Alignment) Scores for Aligned Functions Specified by a Similarity Matrix Example: Sequence 1: -CCGCTTACCTA Sequence 2: TTCCGCTTATTA Possible Alignments: Sequence 1:-CCGCTTACCTA Sequence 2:-CCGCTTA- - - - Score Matches,Gaps and Indels Separately
Global Alignment(contd.) The Scoring Matrix is Called F-Matrix Each (I,j) entry denoted by Fij Running Time: For Sequences of size a and b, O(ab) Summary: Initialization: Fill in Base Cases in Topmost Row and Leftmost Column Filling Partial Alignments: Traceback: Trace back to Initial Pointer Matrix to get best solution
Smith-Waterman Algorithm(Local Alignment) Involving Stretches Shorter than the Entire Sequence Length Generally involves Sequences which are significantly dissimilar Negative Scoring Matrix Cells are Set to Zero Backtracking starts at highest scoring cell and continues to a cell with zero score Prerequisite: Negative Expectation Score
Scoring Functions - Overview Given sequences, a number is associated with each alignment E.g Matches : +x, Mismatches: -y,Gaps: -z Scoring Function: (x X #Matches) –(y X #mismatches) – (z X #Gaps) Alignment Scores: Sum of Substitution Scores and Gap Penalties Residue-Based Substitution Matrices: Protein Evolutionary
Simple Substitution Matrices Expresses How one Character in a Sequence Changes with Other Character States N X N Matrix where: N=4 for DNA and 20 for Amino Acids Another way would be to consider A,G as Purines and T,C as Pyrimidines Purines less likely to occur than Pyrimidines
Minimum Entropy Scoring Function Minimum Entropy Score: Sum of Entropy Scores Computed For Each Column Here, i is a column ciathe counts of letter a at column I piathe inferred probability Gap Characters: Residue Symbols
Gap Functions • Gaps More Likely to Occur in Groups • Examples: • Convex Gap Scoring Functions • Affine Gap Functions • Convex Gap Scoring Functions: • Penalties decrease as Gaps Get Longer • γ(n):for all n, γ(n + 1) - γ(n) ≤ γ(n) - γ(n – 1) • Now F(i,j) = max { F(i-1,j-1) + s(xi,yj) maxk=0...i-1 F(k,j) –γ(i-k) maxk=0...j-1 F(i,k) –γ(j-k)
Affine Gap Functions • Shortcomings of a general gap penalty function: • Different Penalties for Additional Gaps • Cubic Time for Updating Entries • Example: • First Gap Penalized Differently, Subsequent Gaps Penalized Linearly • 3 Matrices Computed Simultaneously
References • http://webcourse.cs.technion.ac.il/236522/Winter2005-2006/ho/WCFiles/tutorial03.ppt • http://engr.smu.edu/~saad/courses/cse8354/lectures/lecture6.pdf • http://www.bioinfo.org.cn/lectures/index-13.html • Needleman, S.B. and Wunsch, Ch.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453. • Smith, T.F. and Waterman, M.S. (1981) Comparison of Biosequences. Adv. appl. Math., 2, 482-489. • Dayhoff,M.O., Barker,W.C. and Hunt,L.T. (1983) Establishing Homologies in Protein Sequences. Methods Enzymol., 91, 524-545. • Gotoh, O. (1982) An Improved Algorithm for Matching Biological Sequences. J. Mol. Biol., 162, 705-708.