Probability Theory and Basic Alignment of String Sequences - PowerPoint PPT Presentation

probability theory and basic alignment of string sequences n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Probability Theory and Basic Alignment of String Sequences PowerPoint Presentation
Download Presentation
Probability Theory and Basic Alignment of String Sequences

play fullscreen
1 / 33
Probability Theory and Basic Alignment of String Sequences
151 Views
Download Presentation
seven
Download Presentation

Probability Theory and Basic Alignment of String Sequences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3 S. Maarschalkerweerd & A. Tjhang

  2. Overview • Probability Theory -Maximum Likelihood -Bayes Theorem • Pairwise Alignment -The Scoring Model -Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

  3. Probability Theory S. Maarschalkerweerd & A. Tjhang

  4. Probability Theory • What is a probabilistic model? • Simple example: What is probability of base sequence x1x2…xn?  p(xi), p(x1), p(x2)…p(xn) independent of each other If pC = 0.3; pT = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018 S. Maarschalkerweerd & A. Tjhang

  5. Maximum Likelihood Estimation • Estimate parameters of the model from large sets of examples (training set) • For example: P(T) and P(C) are estimated from their frequency in a database of residues • Avoid overfitting • Database too small, model also fits to noise in the training set S. Maarschalkerweerd & A. Tjhang

  6. Probability Theory • Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) = Y P(X,Y) = Y P(X|Y) P(Y) (marginal probability) S. Maarschalkerweerd & A. Tjhang

  7. Bayes’ Theorem • P(X|Y) = - Posterior probability • Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = 0.05 - On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y) S. Maarschalkerweerd & A. Tjhang

  8. Pairwise Alignment S. Maarschalkerweerd & A. Tjhang

  9. Pairwise Alignment • Goal: determine whether 2 sequences are related (homologous). • Issues regarding pairwise alignment: • What sorts of alignment should be considered? • The scoring system used to rank alignments. • The algorithm used to find optimal (or good) scoring alignments. • The statistical methods to evaluate significance of an alignment score. S. Maarschalkerweerd & A. Tjhang

  10. Example • You need a ‘smart’ scoring model to distinguish b from c. S. Maarschalkerweerd & A. Tjhang

  11. The Scoring Model S. Maarschalkerweerd & A. Tjhang

  12. The Scoring Model • When sequences are related, then both sequences have to be from a common ancestor. • Due to mutation sequences can change. • Substitutions • Gaps (insertions or deletions) • Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest) S. Maarschalkerweerd & A. Tjhang

  13. The Scoring Model • Total score of an alignment: • Sum of terms for each aligned pair of residues • Terms for each gap • Take the sum of those terms S. Maarschalkerweerd & A. Tjhang

  14. Substitution Matrices • We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) • We can compute these score by: s(a,b) = log( ) pab= probability that residues a and b have been derived independently from some unknown original residue c. qa= frequency of a pab qaqb S. Maarschalkerweerd & A. Tjhang

  15. BLOSUM50 S. Maarschalkerweerd & A. Tjhang

  16. Gap Penalties • (g) = -gd (linear score) • (g) = -d-(g-1)e (affine score) • d = gap-open penalty • e = gap-extension penalty • g = gap length • P(gap) = f(g)  qxi i in gap S. Maarschalkerweerd & A. Tjhang

  17. Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

  18. Alignment Algorithms • Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • Repeated matches • Overlap matches • Hybrid match conditions S. Maarschalkerweerd & A. Tjhang

  19. Dynamic Programming • Enormous amount of possible alignments • Algorithm for finding optimal alignment: Use Dynamic Programming • Save sub-results for later reuse, avoiding calculation of same problem S. Maarschalkerweerd & A. Tjhang

  20. Needleman-Wunsch Algorithm • Global alignment • For sequences of size n and m, make (n+1)x(m+1) matrix • Fill in from top left to bottom right F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d • Keep pointer to cell that is used to derive F(i,j) • Takes O(nm) time and memory { S. Maarschalkerweerd & A. Tjhang

  21. 0 -8 -8 -2 -8 -8 Matrix -2 S. Maarschalkerweerd & A. Tjhang

  22. Matrix Traceback S. Maarschalkerweerd & A. Tjhang

  23. Smith-Waterman Algorithm • Local alignment • Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) { • F(i,j) = max S. Maarschalkerweerd & A. Tjhang

  24. Matrix S. Maarschalkerweerd & A. Tjhang

  25. Smith-Waterman Algorithm • Expected score for a random match s(a,b) must be negative • There must be some s(a,b) greater than 0 or no alignment is found S. Maarschalkerweerd & A. Tjhang

  26. Repeated Matches • Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them • Find parts of sequence in the other sequence • Not every alignment is useful threshold S. Maarschalkerweerd & A. Tjhang

  27. Repeated Matches { F(i, 0) F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { F(i,0) = max S. Maarschalkerweerd & A. Tjhang

  28. Matrix Threshold T = 20 S. Maarschalkerweerd & A. Tjhang

  29. Overlap Matches • Find match between start of a sequence and end of a sequence (can be the same) • Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border S. Maarschalkerweerd & A. Tjhang

  30. Overlap Matches • F(0,j) = 0, for j = 1,…,m • F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d { S. Maarschalkerweerd & A. Tjhang

  31. Matrix S. Maarschalkerweerd & A. Tjhang

  32. Hybrid Match Conditions • Different types of alignment can be created by • adjusting rhs of this formula: F(i,j) = max {…. • adjusting the traceback • Example: • We want to align two sequences from the beginning of both the sequences until local alignment has been found. S. Maarschalkerweerd & A. Tjhang

  33. Summary • Probability theory is important for sequence analysis • Goal: determine whether 2 sequences are related • For that, we need to find an optimal alignment between those sequences using algorithms • Scoring model is required to rank different alignments • Different algorithms for different types of alignments • use dynamic programming S. Maarschalkerweerd & A. Tjhang