1 / 7

Sequence Alignment - III

Sequence Alignment - III. Chitta Baral. Scoring Model. When comparing sequences Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes Substitutions; insertions; deletions (together referred to as gaps)

elga
Download Presentation

Sequence Alignment - III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment - III Chitta Baral

  2. Scoring Model • When comparing sequences • Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection • Basic mutational processes • Substitutions; • insertions; deletions (together referred to as gaps) • Total Score • sum for each aligned pair + terms for each gap • Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated. • Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms • Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms • Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently • Reasonable for DNA and protein sequences • Inaccurate for structural RNAs

  3. Substitution Matrices • Notation: pair of sequence x[1..n] and y[1..m] • Let xi be the ith symbol in x • And yj be the jth symbol in y • Let pxiyi – probability that xi and yi are related • Let qxi – probbaility that we have xi by chance • Frequency of occurrence of xi • Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)] • P(x and y supposing they are related) = px1y1 px2y2 … • P(x and y supposing they are unrelated) = qx1q x2 … X qy1qy2 … • Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X … • Log-odds ratio: s(x1,y1) + s(x2, y2) + … • Where s(a,b) = log (pab/qaqb) • The s(a,b) table is known as the score matrix or substitution matrix

  4. Gap Penalties • Also based on a probabilistic model of alignment • Less widely recognized than the probabilistic basis of substitution matrices • Gap of length g due to insertion of a1…ag • p(gap because of mutation) = f(g) (qa1…qag) • p(having a1…ag by chance) = qa1…qag • Ratio = f(g) • Log of ratio = log (f(g)) • Geometric distribution: f(g) = ke-xg • Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score • Suppose f(g) = ke-ge ; then log of ratio = -ge + log k = -ge + e + (log k - e) = - (e - log k) – (g – 1) e = - d – (g-1) e where d = e – log k ## affine score

  5. Repeated matches • A big string x[1..n] and smaller string y[1..m] • Asymmetric: looking for multiple matches of y in x. • As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y. • F(i,0): Assuming xi is in an unmatched region, what is the best total score so far. • F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far. • F(0,0) = 0. • F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d } • F(i,0) corresponds to start over option (but now we store the total score so far) • F(i,0) = maximum of • F(i-1,0) • F(i-1, j) – T j = 1, …, m • T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)

  6. Illustration of repeated matches

  7. Next • Alignment with affine gap scores. • Heuristic based approach.

More Related