1 / 27

Pairwise a lignment

Computational Genomics and Proteomics. Pairwise a lignment. Genetic code. Searching for similarities. What is the function of the new gene? The “lazy” investigation: Find the set of similar proteins Identify similarities and differences For long proteins: identify domains.

elani
Download Presentation

Pairwise a lignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Genomics and Proteomics Pairwise alignment

  2. Genetic code

  3. Searching for similarities What is the function of the new gene? The “lazy” investigation: Find the set of similar proteins Identify similarities and differences For long proteins: identifydomains

  4. Is similarity really interesting?

  5. Is similarity really interesting? Common ancestry is more interesting Makes it more likely that genes share the same function Homology: sharing a common ancestor a binary property (yes/no)‏ It’s a nice tool:When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X Y

  6. How to determine similarity Frequent evolutionary events: Substitution Insertion, deletion Duplication Inversion Evolution at work Common ancestor, usually extinct Z X Y available We’ll use only these

  7. Alignment - the problem Operations: substitution, Insertion deletion GTACT--C GGA-TGAC Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||algorithms4genomes algorithmsforgenomes ||||||||||? |||||||algorithms4--genomes

  8. Many ways to align 2)‏ 1)‏ 3)‏ GTA--- GTA GT-A ---GCA GCA G-CA • Which alignment is better? • Reflecting evolution ~= • Most probable ~= • Simplest

  9. Scoring Should give reasonable alignments And have to assign scores to: Substitution (or match/mismatch)‏ Gap penalty Linear: g(k)=ak Affine: g(k)=b+ak Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all columns GTACT--C GGA-TGAC GTA-- G-T-A --ATG -ATG-

  10. Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments

  11. Substitution matrices for aa Amino acids are not equal: Some are easily substituted, similar: biochemical properties structure Some mutations occur more often due to similar codons The two above give us substitution matrices http://www.people.virginia.edu/~rjh9u/aminacid.html orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm

  12. BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html

  13. Constant vs. affine gap scoring Gap Scoring intro extension Constant -2 -2 Affine -3 -1 …and +1 for match Seq1 G T A - - G - T - ASeq2 - - A T G - A T G - Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)‏

  14. Global alignment.The Needleman-Wunsch algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming Solve smaller subproblem(s)‏ Iteratively extend the solution The best alignment for X[1…i] and Y[1…j]is called M[i, j] X1 … Xi-1 -- Xi Y1 … - Yj-1 Yj -

  15. The algorithm • Goal: find the maximal scoring alignment • Scores: m match, -s mismatch, -g for insertion/deletion • The best alignment for X[1…i]and Y[1…j]is called M[i, j] • 3 ways to extend the alignment X[1…i-1]X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1]Y[j] Y[1…j-1]Y[j] Y[1…j] - + m M[i,j] = M[i-1, j-1] - s M[i, j-1] - g M[i-1, j] - g

  16. The algorithm – final equation Corresponds to: * X1…Xi-1 XiY1…Yj-1 Yj M[i-1, j-1] + score(X[i],Y[j])‏ M[i, j] = max M[i, j-1] – g X1…Xi -Y1…Yj-1 Yj M[i-1, j] – g X1…Xi-1 XiY1…Yj - j-1 j i-1 * -g i -g

  17. Example: global alignment of two sequences Align two DNA sequences: GAGTGA GAGGCGA (note the length difference)‏ Parameters of the algorithm: Match:score(A,A) = 1 Mismatch: score(A,T) = -1 Gap: g = -2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2

  18. The algorithm. Step 1: init Create the matrix Initiation 0 at [0,0] Apply the equation… j 0 1 2 3 4 5 6 i - G A G T G A 0 - 0 1 G 2 A 3 G 4 G 5 C 6 G 7 A M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2

  19. The algorithm. Step 1: init Initiation of the matrix: 0 at pos [0,0] Fill in the first row using the “” rule Fill in the first column using “” - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 A -4 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

  20. The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 A -4 -1 2 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

  21. The algorithm. Step 2: fill in We are done… Where’s the result? - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

  22. The algorithm. Step 3:traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which one?)‏ - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

  23. The algorithm. Step 3: traceback Extract the alignments - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 j a a)‏ b GAGT-GA GAGGCGA b)‏ i GA-GTGA GAGGCGA

  24. Affine gaps X1…Xi -Y1…Yj-1 Yj • Penalties: • go - gap opening (e.g. -8)‏ • ge - gap extension (e.g. -2)‏ M[i-1, j-1] X1…XiY1…Yj M[i, j] = max score(X[i], Y[j]) + Ix[i-1, j-1] Iy[i-1, j-1] M[i, j-1] + go X1…-Y1… Yj Ix[i, j] = max Ix[i, j-1] + ge X1… - -Y1…Yj-1 Yj M[i-1, j] + go Iy[i, j] = max Iy[i-1, j] + gx • @ home: think of boundary values M[*,0], I[*,0] etc.

  25. Variation on global alignment Globalalignment: previous algorithms is called global alignment, because it uses all letters from both sequences.CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-globalalignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences).CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- Applications of semi-global: Finding a gene in genome Placing marker onto a chromosome One sequence much longer than the other Danger! – really bad alignments for divergent seqs seq X: seq Y:

  26. Semi-global alignment Ignore start gaps First row/column set to 0 Ignore end gaps Read the result from last row/column - G A G T G - 0 0 0 0 0 0 A 0 -1 1 -1 -1 -1 G 0 1 -2 2 0 -2 T 0 -1 0 0 3 1

  27. a heuristics Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

More Related