Sequence alignment iii
This presentation is the property of its rightful owner.
Sponsored Links
1 / 7

Sequence Alignment - III PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Sequence Alignment - III. Chitta Baral. Scoring Model. When comparing sequences Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes Substitutions; insertions; deletions (together referred to as gaps)

Download Presentation

Sequence Alignment - III

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequence alignment iii

Sequence Alignment - III

Chitta Baral


Scoring model

Scoring Model

  • When comparing sequences

    • Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection

  • Basic mutational processes

    • Substitutions;

    • insertions; deletions (together referred to as gaps)

  • Total Score

    • sum for each aligned pair + terms for each gap

    • Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated.

    • Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms

    • Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms

    • Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently

      • Reasonable for DNA and protein sequences

      • Inaccurate for structural RNAs


Substitution matrices

Substitution Matrices

  • Notation: pair of sequence x[1..n] and y[1..m]

    • Let xi be the ith symbol in x

    • And yj be the jth symbol in y

    • Let pxiyi – probability that xi and yi are related

    • Let qxi – probbaility that we have xi by chance

      • Frequency of occurrence of xi

  • Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)]

  • P(x and y supposing they are related) = px1y1 px2y2 …

  • P(x and y supposing they are unrelated) =

    qx1q x2 … X qy1qy2 …

  • Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X …

  • Log-odds ratio: s(x1,y1) + s(x2, y2) + …

    • Where s(a,b) = log (pab/qaqb)

    • The s(a,b) table is known as the score matrix or substitution matrix


Gap penalties

Gap Penalties

  • Also based on a probabilistic model of alignment

    • Less widely recognized than the probabilistic basis of substitution matrices

  • Gap of length g due to insertion of a1…ag

    • p(gap because of mutation) = f(g) (qa1…qag)

    • p(having a1…ag by chance) = qa1…qag

    • Ratio = f(g)

    • Log of ratio = log (f(g))

    • Geometric distribution: f(g) = ke-xg

    • Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score

    • Suppose f(g) = ke-ge ; then log of ratio = -ge + log k

      = -ge + e + (log k - e) = - (e - log k) – (g – 1) e

      = - d – (g-1) e where d = e – log k ## affine score


Repeated matches

Repeated matches

  • A big string x[1..n] and smaller string y[1..m]

  • Asymmetric: looking for multiple matches of y in x.

  • As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y.

  • F(i,0): Assuming xi is in an unmatched region, what is the best total score so far.

  • F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far.

  • F(0,0) = 0.

  • F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d }

    • F(i,0) corresponds to start over option (but now we store the total score so far)

  • F(i,0) = maximum of

    • F(i-1,0)

    • F(i-1, j) – T j = 1, …, m

    • T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)


Illustration of repeated matches

Illustration of repeated matches


Sequence alignment iii

Next

  • Alignment with affine gap scores.

  • Heuristic based approach.


  • Login