Sequence comparaison

Sequence comparaison A little bit of theory does not hurt! Ahmed Rebaï

Definitions • Comparative method: analyze similarities and differences between sequence residues in order to infer structural, functionnal or evolutionary relationship • Sequence alignment: procedure of comparing 2 or more sequences by searching for a series of individual residues or residue patterns that are in the same order in the sequences • Global alignment: stretched over the entire sequence length to include as many matching residues as possible up to the sequence ends • Local alignment: stops at the ends of regions of identity or strong similarity and priority is given to find conserved patterns than to extend the alignment to include neighbouring residues (we will see how!)

Homology

Measuring Similarity • Measuring the extent of similarity between two sequences • Based on percent sequence identity • Based on conservation

Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations in the sequence. • Some of these mutations have little effect on the organism’s function, therefore some penalties will be less harsh than others.

AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11 Scoring Matrix: Example • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein.

Conservation • Amino acid changes that preserve the physico-chemical properties of the original residue • Polar to polar • aspartate  glutamate • Nonpolar to nonpolar • alanine  valine • Similarly behaving residues • leucine to isoleucine

Alignment scores • A global score is calculated as: • a score is attributed to each position of the alignment according to: • Match/mismatch/gap in DNA sequences • Substitution of an aminoacids by another • gap for protein and DNA sequences • We add scores for the whole alignment: S=  S_elem +  S_Gap • Gap scores are negative because these are penalities • By choosing these scores we can either favour alignments with few gaps (or ungapped) or with many gaps

Scoring Indels: Naive Approach • A fixed penalty σis given to every indel: • -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels

In nature, this is more likely. Affine Gap Penalties • In nature, very often indels come as a unit, not just at 1 nucleotide at a time. ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments

Gap scores • The Gap score depends on its length • P= x+ y l • x: Gap-opening penality • y: elementary gap extension penality y<< x (x/10) for example by default: • x=-11, y=-1 (blastp) and x=-5, y=-2 (Blastn) • By this choice a large gap is more likely than many small gaps of te same length (which is consistent with biological phenomena)

Score matrix or substitution matrix • for DNA sequences: 1 for a match, -2 for a mismatch • for protein sequences: 20x20 matrix • Based on model of protein evolution • Based on analysis of highly conserved segments in proteins (Blocks) • Based on physico-chemical or structural properties

PAM Matrices (Dayhoff et al., 1978) • list the likelihood of change from one aa to another in homologus protein sequences during evolution • The change of an aa A by B is assumed the same regardless of any pevious change at that site and the position of A in the protein • Changes are in closely related protein and are assumed to represent aa substitution tha do not change the protein function and are called «accepted mutations »

PAM: point accepted mutation • 1 PAM = PAM1 = 1% average change of all amino acid positions • After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may mot changed at all

PAM matrix • Relative mutability of aa is evaluated by counting (in a group of related sequences) the number of changes of each aa and dividing by the « exposure to mutation » of that aa • Exposure is calculated as the product of the frequency of occurrence of the aa in the group of seq. and the total number of aa changes the occured in that group per 100 sites • Calculated from etimations of aa susbtitution that occured in a group of evolving porteins using 1572 changes in 71 groups of proteins sequences that were at least 85% similar.

PAM matrix • The most mutable aa are Asn, Ser, Asp and Glu and the least mutable are Cys and Try • One we get PAM1 matrix we should multiply it by itself N times if we want to predict changes for more distantly related proteins that have undergone N mutations • The commonl used is PAM250 which represents a level of 250 changes for 100 aa expected in 2500 my. Such sequences are expected to have about 20% similarity

PAM matrix • 260 changes were observed between Phe and Tyr among the 1572 changes • The scoe of changing Phe to Tyr wa 0.0021 (0.9946 for not changing Phe) • PAM-250 the proba of Phe to Tyr chnage is 0.15 an that of no change is 0.32 • The frequency of Phe was 0.04 so the relative frequency of of change is 0.15/0.04=3.75 . 10Log10(3.75)=5.7. The score for a change of Tyr to phe was found 8.3 so use the average value of these two which is 7

PAM250

PAM matrix • Values are between -8 to 17 • 0 indicates the the frequency of the subsitution between these two aa is expected by chance • negative means that the frequency of substitution is less than expected by chance • For example no cnages between Gly (G) to Trp (W) resulting in a score of -7

BLOSUM (Henikoff et al, 1993) • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance • BLOSUM62 was created using sequences sharing no more than 62% identity

BLOSUM matrix • An alternative that does not involve an explicit model of evolutionary change is to just find blocks of similar sequence and count how often one amino acid is substituted by another • Good for general comparisons without consideration of evolutionary distance

Comparison PAM-BLOSUM

Choice of a Matrix! BLOSUM90 PAM30 BLOSUM80 PAM120 BLOSUM62 PAM180 BLOSUM45 PAM240 Rat versus mouse RBP Rat versus bacterial lipocalin

Methods for aligning 2 sequences • Dot matrix analysis • Dynamic programming algorithm • Word or k-tuple methods such that used in the programs BLAST and FASTA • Bayesian sequence alignment algorithms

The Dot-plot: a first step !

Dot plot

Dynamic programming • Compoutational method that provides de very best (optimal) alignment between two sequences • Compare every pair of residues in the two sequences and includes: matches, mismatches and gaps A T G T - A A T G C A T G | | | | | | | | A T G T G A A T -- A

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)

A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2) 0 1 2 3 4 5 6 7

A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4) 0 1 2 3 4 5 6 7

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - End Result -

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph • and represent indels in v and w • Score 0. • represent exact matches. • Score 1.

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph The score of the alignment path in the graph is 5.

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),(7,7)

Use this scoring algorithm si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { Alignment: Dynamic Programming

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Dynamic Programming Example • There are no matches in the beginning of the sequence • Label column i=1 to be all zero, and row j=1 to be all zero 0 0 0 0 0 0 0 0

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 Si,j = Si-1, j-1 max Si-1, j Si, j-1 A 6 0 { T 7 0 0 Dynamic Programming Example 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 value from NW +1, if vi = wj  value from North (top)  value from West (left) 1 1 1 1 1 1

Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Backtracking Example Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, S(i,j) = S(i-1,j-1) +1 S(2,2) = [S(1,1) = 1] + 1 S(2,5) = [S(1,4) = 1] + 1 S(4,2) = [S(3,1) = 1] + 1 S(5,2) = [S(4,1) = 1] + 1 S(7,2) = [S(6,1) = 1] + 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2

A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Backtracking Example 0 0 0 0 0 0 0 0 Continuing with the scoring algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Sequence comparaison

Sequence comparaison

Presentation Transcript

Comparaison MOTU/BSCW

MicroARNs, comparaison, prediction

La Comparaison des adjectifs