Create Presentation
Download Presentation

Download Presentation

Probability Theory and Basic Alignment of String Sequences

Download Presentation
## Probability Theory and Basic Alignment of String Sequences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Probability Theory and Basic Alignment of String Sequences**Chapter 1.1-2.3 S. Maarschalkerweerd & A. Tjhang**Overview**• Probability Theory -Maximum Likelihood -Bayes Theorem • Pairwise Alignment -The Scoring Model -Alignment Algorithms S. Maarschalkerweerd & A. Tjhang**Probability Theory**S. Maarschalkerweerd & A. Tjhang**Probability Theory**• What is a probabilistic model? • Simple example: What is probability of base sequence x1x2…xn? p(xi), p(x1), p(x2)…p(xn) independent of each other If pC = 0.3; pT = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018 S. Maarschalkerweerd & A. Tjhang**Maximum Likelihood Estimation**• Estimate parameters of the model from large sets of examples (training set) • For example: P(T) and P(C) are estimated from their frequency in a database of residues • Avoid overfitting • Database too small, model also fits to noise in the training set S. Maarschalkerweerd & A. Tjhang**Probability Theory**• Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) = Y P(X,Y) = Y P(X|Y) P(Y) (marginal probability) S. Maarschalkerweerd & A. Tjhang**Bayes’ Theorem**• P(X|Y) = - Posterior probability • Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = 0.05 - On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y) S. Maarschalkerweerd & A. Tjhang**Pairwise Alignment**S. Maarschalkerweerd & A. Tjhang**Pairwise Alignment**• Goal: determine whether 2 sequences are related (homologous). • Issues regarding pairwise alignment: • What sorts of alignment should be considered? • The scoring system used to rank alignments. • The algorithm used to find optimal (or good) scoring alignments. • The statistical methods to evaluate significance of an alignment score. S. Maarschalkerweerd & A. Tjhang**Example**• You need a ‘smart’ scoring model to distinguish b from c. S. Maarschalkerweerd & A. Tjhang**The Scoring Model**S. Maarschalkerweerd & A. Tjhang**The Scoring Model**• When sequences are related, then both sequences have to be from a common ancestor. • Due to mutation sequences can change. • Substitutions • Gaps (insertions or deletions) • Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest) S. Maarschalkerweerd & A. Tjhang**The Scoring Model**• Total score of an alignment: • Sum of terms for each aligned pair of residues • Terms for each gap • Take the sum of those terms S. Maarschalkerweerd & A. Tjhang**Substitution Matrices**• We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) • We can compute these score by: s(a,b) = log( ) pab= probability that residues a and b have been derived independently from some unknown original residue c. qa= frequency of a pab qaqb S. Maarschalkerweerd & A. Tjhang**BLOSUM50**S. Maarschalkerweerd & A. Tjhang**Gap Penalties**• (g) = -gd (linear score) • (g) = -d-(g-1)e (affine score) • d = gap-open penalty • e = gap-extension penalty • g = gap length • P(gap) = f(g) qxi i in gap S. Maarschalkerweerd & A. Tjhang**Alignment Algorithms**S. Maarschalkerweerd & A. Tjhang**Alignment Algorithms**• Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • Repeated matches • Overlap matches • Hybrid match conditions S. Maarschalkerweerd & A. Tjhang**Dynamic Programming**• Enormous amount of possible alignments • Algorithm for finding optimal alignment: Use Dynamic Programming • Save sub-results for later reuse, avoiding calculation of same problem S. Maarschalkerweerd & A. Tjhang**Needleman-Wunsch Algorithm**• Global alignment • For sequences of size n and m, make (n+1)x(m+1) matrix • Fill in from top left to bottom right F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d • Keep pointer to cell that is used to derive F(i,j) • Takes O(nm) time and memory { S. Maarschalkerweerd & A. Tjhang**0**-8 -8 -2 -8 -8 Matrix -2 S. Maarschalkerweerd & A. Tjhang**Matrix**Traceback S. Maarschalkerweerd & A. Tjhang**Smith-Waterman Algorithm**• Local alignment • Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) { • F(i,j) = max S. Maarschalkerweerd & A. Tjhang**Matrix**S. Maarschalkerweerd & A. Tjhang**Smith-Waterman Algorithm**• Expected score for a random match s(a,b) must be negative • There must be some s(a,b) greater than 0 or no alignment is found S. Maarschalkerweerd & A. Tjhang**Repeated Matches**• Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them • Find parts of sequence in the other sequence • Not every alignment is useful threshold S. Maarschalkerweerd & A. Tjhang**Repeated Matches**{ F(i, 0) F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { F(i,0) = max S. Maarschalkerweerd & A. Tjhang**Matrix**Threshold T = 20 S. Maarschalkerweerd & A. Tjhang**Overlap Matches**• Find match between start of a sequence and end of a sequence (can be the same) • Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border S. Maarschalkerweerd & A. Tjhang**Overlap Matches**• F(0,j) = 0, for j = 1,…,m • F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d { S. Maarschalkerweerd & A. Tjhang**Matrix**S. Maarschalkerweerd & A. Tjhang**Hybrid Match Conditions**• Different types of alignment can be created by • adjusting rhs of this formula: F(i,j) = max {…. • adjusting the traceback • Example: • We want to align two sequences from the beginning of both the sequences until local alignment has been found. S. Maarschalkerweerd & A. Tjhang**Summary**• Probability theory is important for sequence analysis • Goal: determine whether 2 sequences are related • For that, we need to find an optimal alignment between those sequences using algorithms • Scoring model is required to rank different alignments • Different algorithms for different types of alignments • use dynamic programming S. Maarschalkerweerd & A. Tjhang