sequence alignment iii
Download
Skip this Video
Download Presentation
Sequence Alignment - III

Loading in 2 Seconds...

play fullscreen
1 / 7

Sequence Alignment - III - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Sequence Alignment - III. Chitta Baral. Scoring Model. When comparing sequences Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes Substitutions; insertions; deletions (together referred to as gaps)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sequence Alignment - III' - elga


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scoring model
Scoring Model
  • When comparing sequences
    • Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection
  • Basic mutational processes
    • Substitutions;
    • insertions; deletions (together referred to as gaps)
  • Total Score
    • sum for each aligned pair + terms for each gap
    • Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated.
    • Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms
    • Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms
    • Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently
      • Reasonable for DNA and protein sequences
      • Inaccurate for structural RNAs
substitution matrices
Substitution Matrices
  • Notation: pair of sequence x[1..n] and y[1..m]
    • Let xi be the ith symbol in x
    • And yj be the jth symbol in y
    • Let pxiyi – probability that xi and yi are related
    • Let qxi – probbaility that we have xi by chance
      • Frequency of occurrence of xi
  • Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)]
  • P(x and y supposing they are related) = px1y1 px2y2 …
  • P(x and y supposing they are unrelated) =

qx1q x2 … X qy1qy2 …

  • Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X …
  • Log-odds ratio: s(x1,y1) + s(x2, y2) + …
    • Where s(a,b) = log (pab/qaqb)
    • The s(a,b) table is known as the score matrix or substitution matrix
gap penalties
Gap Penalties
  • Also based on a probabilistic model of alignment
    • Less widely recognized than the probabilistic basis of substitution matrices
  • Gap of length g due to insertion of a1…ag
    • p(gap because of mutation) = f(g) (qa1…qag)
    • p(having a1…ag by chance) = qa1…qag
    • Ratio = f(g)
    • Log of ratio = log (f(g))
    • Geometric distribution: f(g) = ke-xg
    • Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score
    • Suppose f(g) = ke-ge ; then log of ratio = -ge + log k

= -ge + e + (log k - e) = - (e - log k) – (g – 1) e

= - d – (g-1) e where d = e – log k ## affine score

repeated matches
Repeated matches
  • A big string x[1..n] and smaller string y[1..m]
  • Asymmetric: looking for multiple matches of y in x.
  • As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y.
  • F(i,0): Assuming xi is in an unmatched region, what is the best total score so far.
  • F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far.
  • F(0,0) = 0.
  • F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d }
    • F(i,0) corresponds to start over option (but now we store the total score so far)
  • F(i,0) = maximum of
    • F(i-1,0)
    • F(i-1, j) – T j = 1, …, m
    • T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)
slide7
Next
  • Alignment with affine gap scores.
  • Heuristic based approach.
ad