# Sequence Alignment - III - PowerPoint PPT Presentation

1 / 7

Sequence Alignment - III. Chitta Baral. Scoring Model. When comparing sequences Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes Substitutions; insertions; deletions (together referred to as gaps)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Sequence Alignment - III

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Sequence Alignment - III

Chitta Baral

### Scoring Model

• When comparing sequences

• Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection

• Basic mutational processes

• Substitutions;

• insertions; deletions (together referred to as gaps)

• Total Score

• sum for each aligned pair + terms for each gap

• Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated.

• Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms

• Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms

• Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently

• Reasonable for DNA and protein sequences

• Inaccurate for structural RNAs

### Substitution Matrices

• Notation: pair of sequence x[1..n] and y[1..m]

• Let xi be the ith symbol in x

• And yj be the jth symbol in y

• Let pxiyi – probability that xi and yi are related

• Let qxi – probbaility that we have xi by chance

• Frequency of occurrence of xi

• Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)]

• P(x and y supposing they are related) = px1y1 px2y2 …

• P(x and y supposing they are unrelated) =

qx1q x2 … X qy1qy2 …

• Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X …

• Log-odds ratio: s(x1,y1) + s(x2, y2) + …

• Where s(a,b) = log (pab/qaqb)

• The s(a,b) table is known as the score matrix or substitution matrix

### Gap Penalties

• Also based on a probabilistic model of alignment

• Less widely recognized than the probabilistic basis of substitution matrices

• Gap of length g due to insertion of a1…ag

• p(gap because of mutation) = f(g) (qa1…qag)

• p(having a1…ag by chance) = qa1…qag

• Ratio = f(g)

• Log of ratio = log (f(g))

• Geometric distribution: f(g) = ke-xg

• Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score

• Suppose f(g) = ke-ge ; then log of ratio = -ge + log k

= -ge + e + (log k - e) = - (e - log k) – (g – 1) e

= - d – (g-1) e where d = e – log k ## affine score

### Repeated matches

• A big string x[1..n] and smaller string y[1..m]

• Asymmetric: looking for multiple matches of y in x.

• As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y.

• F(i,0): Assuming xi is in an unmatched region, what is the best total score so far.

• F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far.

• F(0,0) = 0.

• F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d }

• F(i,0) corresponds to start over option (but now we store the total score so far)

• F(i,0) = maximum of

• F(i-1,0)

• F(i-1, j) – T j = 1, …, m

• T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)

### Next

• Alignment with affine gap scores.

• Heuristic based approach.