- By
**elga** - Follow User

- 131 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Sequence Alignment - III' - elga

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Sequence Alignment - III

Chitta Baral

Scoring Model

- When comparing sequences
- Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection
- Basic mutational processes
- Substitutions;
- insertions; deletions (together referred to as gaps)
- Total Score
- sum for each aligned pair + terms for each gap
- Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated.
- Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms
- Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms
- Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently
- Reasonable for DNA and protein sequences
- Inaccurate for structural RNAs

Substitution Matrices

- Notation: pair of sequence x[1..n] and y[1..m]
- Let xi be the ith symbol in x
- And yj be the jth symbol in y
- Let pxiyi – probability that xi and yi are related
- Let qxi – probbaility that we have xi by chance
- Frequency of occurrence of xi
- Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)]
- P(x and y supposing they are related) = px1y1 px2y2 …
- P(x and y supposing they are unrelated) =

qx1q x2 … X qy1qy2 …

- Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X …
- Log-odds ratio: s(x1,y1) + s(x2, y2) + …
- Where s(a,b) = log (pab/qaqb)
- The s(a,b) table is known as the score matrix or substitution matrix

Gap Penalties

- Also based on a probabilistic model of alignment
- Less widely recognized than the probabilistic basis of substitution matrices
- Gap of length g due to insertion of a1…ag
- p(gap because of mutation) = f(g) (qa1…qag)
- p(having a1…ag by chance) = qa1…qag
- Ratio = f(g)
- Log of ratio = log (f(g))
- Geometric distribution: f(g) = ke-xg
- Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score
- Suppose f(g) = ke-ge ; then log of ratio = -ge + log k

= -ge + e + (log k - e) = - (e - log k) – (g – 1) e

= - d – (g-1) e where d = e – log k ## affine score

Repeated matches

- A big string x[1..n] and smaller string y[1..m]
- Asymmetric: looking for multiple matches of y in x.
- As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y.
- F(i,0): Assuming xi is in an unmatched region, what is the best total score so far.
- F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far.
- F(0,0) = 0.
- F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d }
- F(i,0) corresponds to start over option (but now we store the total score so far)
- F(i,0) = maximum of
- F(i-1,0)
- F(i-1, j) – T j = 1, …, m
- T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)

Next

- Alignment with affine gap scores.
- Heuristic based approach.

Download Presentation

Connecting to Server..