1 / 61

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 6: Sequence Alignment Statistics. Review of last lecture. How to map gaps more accurately?. GACGCCGAACG ||||| ||| GACGC---ACG. GACGCCGAACG |||| | | || GACG-C-A-CG. Score = 8 x m – 3 x d. Score = 8 x m – 3 x d. Gaps usually occur in bunches

avent
Download Presentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

  2. Review of last lecture • How to map gaps more accurately? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • Gaps usually occur in bunches • During evolution, chunks of DNA may be lost or inserted entirely • Aligning genomic sequences vs. cDNAs: cDNAs are spliced versions of the genomic seqs

  3. Model gaps more accurately • Previous model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n) F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) • Running Time: O((M+N)MN) (cubic) • Space: O(NM)  n  n

  4. Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 • We want to find the optimal alignment with affine gap penalty in • O(MN) time • O(MN) or better O(M+N) memory

  5. Dynamic programming • Consider three sub-problems when aligning x1..xi and y1..yj • F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj Iy(i, j) Ix(i, j) F(i, j)

  6. Input Output (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) / 

  7. (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e start state (xi,yj) /  F-F-Iy-F-Ix F-F-F-F F-Iy-F-F-Ix AAC ||| ACT AAC- | | A-CT AAC- || -ACT AAC ACT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: a state path to read the two sequences such that the total output score is the highest

  8. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj

  9. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)

  10. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)

  11. F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

  12. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F: aligned on both Iy: Insertion on y y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x

  13. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

  14. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

  15. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

  16. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

  17. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

  18. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

  19. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

  20. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

  21. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C x GCAC || | GC-C y F Iy y = G C C y = G C C x = x = G C A C G C A C Ix

  22. Today: statistics of alignment Where does (xi, yj)come from? Are two aligned sequences actually related?

  23. Probabilistic model of alignments • We’ll first focus on protein alignments without gaps • Given an alignment, we can consider two possible models • R: the sequences are related by evolution • U: the sequences are unrelated • How can we distinguish these two models? • How is this view related to amino-acid substitution matrix?

  24. Model for unrelated sequences • Assume each position of the alignment is independently sampled from some distribution of amino acids • ps: probability of amino acid s in the sequences • Probability of seeing an amino acid s aligned to an amino acid t by chance is • Pr(s, t | U) = ps * pt • Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is i

  25. Model for related sequences • Assume each pair of aligned amino acids evolved from a common ancestor • Let qst be the probability that amino acid s in one sequence is related to t in another sequence • The probability of an alignment of x and y is give by

  26. Probabilistic model of Alignments • How can we decide which model (U or R) is more likely? • One principled way is to consider the relative likelihood of the two models (the odd ratios) • A higher ratio means that R is more likely than U

  27. Log odds ratio • Taking logarithm, we get • Recall that the score of an alignment is given by

  28. Therefore, if we define • We are actually defining the alignment score as the log odds ratio between the two models R and U

  29. How to get the probabilities? • ps can be counted from the available protein sequences • But how do we get qst? (the probability that s and t have a common ancestor) • Counted from trusted alignments of related sequences

  30. Protein Substitution Matrices • Two popular sets of matrices for protein sequences • PAM matrices [Dayhoff et al, 1978] • Better for aligning closely related sequences • BLOSUM matrices [Henikoff & Henikoff, 1992] • For both closely or remotely related sequences

  31. BLOSUM-N matrices • Constructed from a database called BLOCKS • Contain many closely related sequences • Conserved amino acids may be over-counted • N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity • identity: % of matched columns • Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

  32. : Scaling factor to convert score to integer. Important: when you are told that a scoring matrix is in half-bits =>  = ½ ln2 Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights

  33. 45 62 90 Weak homology Strong homology BLOSUM-N matrices • If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75 • On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 • BLOSUM-62: good for most purposes

  34. For DNAs • No database of trusted alignments to start with • Specify the percentage identity you would like to detect • You can then get the substitution matrix by some calculation

  35. For example • Suppose pA = pC = pT = pG = 0.25 • We want 88% identity • qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 • (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 • (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t.

  36. Substitution matrix

  37. Scale won’t change the alignment • Multiply by 4 and then round off to get integers

  38. Arbitrary substitution matrix • Say you have a substitution matrix provided by someone • It’s important to know what you are actually looking for when you use the matrix

  39. NCBI-BLAST WU-BLAST • What’s the difference? • Which one should I use for my sequences?

  40. We had • Scale it, so that • Reorganize:

  41. Since all probabilities must sum to 1, • We have • Suppose again ps = 0.25 for any s • We know (s, t) from the substitution matrix • We can solve the equation for λ • Plug λ into to get qst

  42. NCBI-BLAST WU-BLAST  = 1.33 qst = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity  = 0.19 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

  43. Details for solving  Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t. Since and s,t qst = 1, we have 12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have ¾ x-2 + ¼ x = 1. Hence, x3 – 4x2 + 3 = 0; • X has three solutions: 3.8, 1, -0.8 • Only the first leads to a positive  •  = ln (3.8) = 1.33

  44. Today: statistics of alignment Where does (xi, yj)come from? Are two aligned sequences actually related?

  45. Statistics of Alignment Scores • Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)? • Is a score 82 good? What about 180? • A: determine how likely it is that such an alignment score would result from chance

  46. P-value of alignment • p-value • The probability that the alignment score can be obtained from aligning random sequences • Small p-value means the score is unlikely to happen by chance • The most common thresholds are 0.01 and 0.05 • Also depend on purpose of comparison and cost of misclaim

  47. Statistics of global seq alignment • Theory only applies to local alignment • For global alignment, your best bet is to do Monte-Carlo simulation • What’s the chance you can get a score as high as the real alignment by aligning two random sequences? • Procedure • Given sequence X, Y • Compute a global alignment (score = S) • Randomly shuffle sequence X (or Y) N times, obtain X1, X2, …, XN • Align each Xi with Y, (score = Ri) • P-value: the fraction of Ri >= S

  48. Human HEXA Fly HEXO1 Score = -74

  49. -74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

  50. Mouse HEXA Human HEXA Score = 732 ……………………………………………………

More Related