Computational Genomics: Pairwise Alignment Techniques for Protein Similarity Analysis

Computational Genomics and Proteomics Pairwise alignment

Genetic code

Searching for similarities What is the function of the new gene? The “lazy” investigation: Find the set of similar proteins Identify similarities and differences For long proteins: identifydomains

Is similarity really interesting?

Is similarity really interesting? Common ancestry is more interesting Makes it more likely that genes share the same function Homology: sharing a common ancestor a binary property (yes/no)‏ It’s a nice tool:When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X Y

How to determine similarity Frequent evolutionary events: Substitution Insertion, deletion Duplication Inversion Evolution at work Common ancestor, usually extinct Z X Y available We’ll use only these

Alignment - the problem Operations: substitution, Insertion deletion GTACT--C GGA-TGAC Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||algorithms4genomes algorithmsforgenomes ||||||||||? |||||||algorithms4--genomes

Many ways to align 2)‏ 1)‏ 3)‏ GTA--- GTA GT-A ---GCA GCA G-CA • Which alignment is better? • Reflecting evolution ~= • Most probable ~= • Simplest

Scoring Should give reasonable alignments And have to assign scores to: Substitution (or match/mismatch)‏ Gap penalty Linear: g(k)=ak Affine: g(k)=b+ak Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all columns GTACT--C GGA-TGAC GTA-- G-T-A --ATG -ATG-

Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments

Substitution matrices for aa Amino acids are not equal: Some are easily substituted, similar: biochemical properties structure Some mutations occur more often due to similar codons The two above give us substitution matrices http://www.people.virginia.edu/~rjh9u/aminacid.html orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm

BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html

Constant vs. affine gap scoring Gap Scoring intro extension Constant -2 -2 Affine -3 -1 …and +1 for match Seq1 G T A - - G - T - ASeq2 - - A T G - A T G - Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)‏

Global alignment.The Needleman-Wunsch algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming Solve smaller subproblem(s)‏ Iteratively extend the solution The best alignment for X[1…i] and Y[1…j]is called M[i, j] X1 … Xi-1 -- Xi Y1 … - Yj-1 Yj -

The algorithm • Goal: find the maximal scoring alignment • Scores: m match, -s mismatch, -g for insertion/deletion • The best alignment for X[1…i]and Y[1…j]is called M[i, j] • 3 ways to extend the alignment X[1…i-1]X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1]Y[j] Y[1…j-1]Y[j] Y[1…j] - + m M[i,j] = M[i-1, j-1] - s M[i, j-1] - g M[i-1, j] - g

The algorithm – final equation Corresponds to: * X1…Xi-1 XiY1…Yj-1 Yj M[i-1, j-1] + score(X[i],Y[j])‏ M[i, j] = max M[i, j-1] – g X1…Xi -Y1…Yj-1 Yj M[i-1, j] – g X1…Xi-1 XiY1…Yj - j-1 j i-1 * -g i -g

Example: global alignment of two sequences Align two DNA sequences: GAGTGA GAGGCGA (note the length difference)‏ Parameters of the algorithm: Match:score(A,A) = 1 Mismatch: score(A,T) = -1 Gap: g = -2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2

The algorithm. Step 1: init Create the matrix Initiation 0 at [0,0] Apply the equation… j 0 1 2 3 4 5 6 i - G A G T G A 0 - 0 1 G 2 A 3 G 4 G 5 C 6 G 7 A M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2

The algorithm. Step 1: init Initiation of the matrix: 0 at pos [0,0] Fill in the first row using the “” rule Fill in the first column using “” - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 A -4 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 A -4 -1 2 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

The algorithm. Step 2: fill in We are done… Where’s the result? - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

The algorithm. Step 3:traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which one?)‏ - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i

The algorithm. Step 3: traceback Extract the alignments - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 j a a)‏ b GAGT-GA GAGGCGA b)‏ i GA-GTGA GAGGCGA

Affine gaps X1…Xi -Y1…Yj-1 Yj • Penalties: • go - gap opening (e.g. -8)‏ • ge - gap extension (e.g. -2)‏ M[i-1, j-1] X1…XiY1…Yj M[i, j] = max score(X[i], Y[j]) + Ix[i-1, j-1] Iy[i-1, j-1] M[i, j-1] + go X1…-Y1… Yj Ix[i, j] = max Ix[i, j-1] + ge X1… - -Y1…Yj-1 Yj M[i-1, j] + go Iy[i, j] = max Iy[i-1, j] + gx • @ home: think of boundary values M[*,0], I[*,0] etc.

Variation on global alignment Globalalignment: previous algorithms is called global alignment, because it uses all letters from both sequences.CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-globalalignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences).CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- Applications of semi-global: Finding a gene in genome Placing marker onto a chromosome One sequence much longer than the other Danger! – really bad alignments for divergent seqs seq X: seq Y:

Semi-global alignment Ignore start gaps First row/column set to 0 Ignore end gaps Read the result from last row/column - G A G T G - 0 0 0 0 0 0 A 0 -1 1 -1 -1 -1 G 0 1 -2 2 0 -2 T 0 -1 0 0 3 1

a heuristics Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Computational Genomics: Pairwise Alignment Techniques for Protein Similarity Analysis

Computational Genomics: Pairwise Alignment Techniques for Protein Similarity Analysis

Presentation Transcript

C urriculum A lignment P rocess

C ontrast R epetition A lignment P roximity

Creating a Pairwise Comparison Chart

Pairwise Alignments

Pairwise Testing

Pairwise Alignments

Pairwise sequence Alignment

C ontrast R epetition A lignment P roximity

Pairwise Alignment

A lignment Class II

Pairwise alignment

Pairwise alignments

A lignment Class III

Pairwise alignment

T he A lignment F actor tm

Pairwise alignment

Pairwise Alignments

Pairwise alignments

Pairwise alignments

A lignment Class III

Pairwise alignment

A lignment Class II