Pairwise sequence alignments

Pairwise sequence alignments Etienne de Villiers Adapted with permission of Swiss EMBnet node and SIB

Outline • Introduction • Definitions • Biological context of pairwise alignments • Computing of pairwise alignments • Some programs

Importance of pairwise alignments • Sequence analysis tools depending on pairwise comparison • Multiple alignments • Profile and HMM making • (used to search for protein families and domains) • 3D protein structure prediction • Phylogenetic analysis • Construction of certain substitution matrices • Similarity searches in a database

THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY Extrapolate ??? THIO_EMENI SwissProt Goal • Sequence comparison through pairwise alignments • Goal of pairwise comparison is to find conserved regions (if any) between two sequences • Extrapolate information about our sequence using the known characteristics of the other sequence

Relationships Same Sequence Same Origin Same Function Same 3D Fold Do alignments make sense ? • Evolution of sequences • Sequences evolve through mutation and selection • Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge, etc.) • Modular nature of proteins • Nature keeps re-using domains • Alignments try to tell the evolutionnary story of the proteins

Example: An alignment - textual view • Two similar regions of the Drosophila melanogaster Slit and Notch proteins 970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790

Example: An alignment - graphical view • Comparing the tissue-type and urokinase type plasminogen activators. Displayed using a diagonal plot or Dotplot. Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Some definitions • Identity • Proportion of pairs of identical residues between two aligned sequences. • Generally expressed as a percentage. • This value strongly depends on how the two sequences are aligned. • Similarity • Proportion of pairs of similar residues between two aligned sequences. • If two residues are similar is determined by a substitution matrix. • This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. • Homology • Two sequences are homologous if and only if they have a common ancestor. • There is no such thing as a level of homology ! (It's either yes or no) • Homologous sequences do not necessarily serve the same function... • ... Nor are they always highly similar: structure may be conserved while sequence is not.

Globins True negatives G G G False positives G True positives G G G G X X X X False negatives X Matches Definition example The set of all globins and a test to identify them Consider: • a set S(say, globins: G) • a test t that tries to detect members of S • (for example, through a pairwise comparison with another globin).

More definitions Consider a set S (say, globins) and a test t that tries to detect members of S (for example, through a pairwise comparison with another globin). True positive A protein is a true positive if it belongs to S and is detected by t. True negative A protein is a true negative if it does not belong to S and is not detected by t. False positive A protein is a false positive if it does not belong to S and is (incorrectly) detected by t. False negative A protein is a false negative if it belongs to S and is not detected by t (but should be).

True positives True negatives False positives False negatives Less sensitivity Greater selectivity Even more definitions Sensitivity Ability of a method to detect positives, irrespective of how many false positives are reported. Selectivity Ability of a method to reject negatives, irrespective of how many false negatives are rejected. Greater sensitivity Less selectivity

deletion errors / mismatches insertion Pairwise sequence alignment • Concept of a sequence alignment • Pairwise Alignment: • Explicit mapping between the residues of 2 sequences • Tolerant to errors (mismatches, insertion / deletions or indels) • Evaluation of the alignment in a biological concept (significance) Seq A GARFIELDTHELASTFA-TCAT ||||||||||| || |||| Seq B GARFIELDTHEVERYFASTCAT

Pairwise sequence alignment • Number of alignments • There are many ways to align two sequences • Consider the sequence fragments below: a simple alignment shows some conserved portions CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA but also: CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA • Number of possible alignments for 2 sequences of length 1000 residues: • more than 10600 gapped alignments (Avogadro 1024, estimated number of atoms in the universe 1080)

Alignment evaluation • What is a good alignment ? • We need a way to evaluate the biological meaning of a given alignment • Intuitively we "know" that the following alignment: CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA is better than: ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG • We can express this notion more rigorously, by using a • scoring system

CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA • Score: 12 ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG • Score: 5 Scoring system • Simple alignment scores • A simple way (but not the best) to score an alignment is to count 1 for each match and 0 for each mismatch.

Introducing biological information • Importance of the scoring system • discrimination of significant biological alignments • Based on physico-chemical properties of amino-acids • Hydrophobicity, acid / base, sterical properties, ... • Scoring system scales are arbitrary • Based on biological sequence information • Substitutions observed in structural or evolutionary alignments of well studied protein families • Scoring systems have a probabilistic foundation • Substitution matrices • In proteins some mismatches are more acceptable than others • Substitution matrices give a score for each substitution of one amino-acid by another

... • Positive score: the amino acids are similar, mutations from one into the other occur more often then expected by chance during evolution • Negative score: the amino acids are dissimilar, the mutation from one into the other occurs less often then expected by chance during evolution Substitution matrices (log-odds matrices) Example matrix • For a set of well known proteins: • Align the sequences • Count the mutations at each position • For each substitution set the score to the log-odd ratio (Leu, Ile): 2 (Leu, Cys): -6 PAM250 From: A. D. Baxevanis, "Bioinformatics"

Matrix choice • Different kind of matrices • PAM series (Dayhoff M., 1968, 1972, 1978) Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. • Based on 1572 protein sequences from 71 families • Old standard matrix: PAM250

Matrix choice • Different kind of matrices • BLOSUM series (Henikoff S. & Henikoff JG., PNAS, 1992) • Blocks Substitution Matrix. • A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. • Based on alignments in the BLOCKS database • Standard matrix: BLOSUM62

Matrix choice • Limitations • Substitution matrices do not take into account long range interactions between residues. • They assume that identical residues are equal ( whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site) • They assume evolution rate to be constant.

Raw score of an alignment TPEA ¦| | APGA Score = + 6 + 0 + 2 Alignment score • Amino acid substitution matrices • Example: PAM250 • Most used: Blosum62 1 = 9

can be improved by inserting a gap GCATGCATG--CAACTGCAT ||||||||| ||||||||| GCATGCATGGGCAACTGCAT Gaps • Insertions or deletions • Proteins often contain regions where residues have been inserted or deleted during evolution • There are constraints on where these insertions and deletions can happen (between structural or functional elements like: alpha helices, active site, etc.) • Gaps in alignments GCATGCATGCAACTGCAT ||||||||| GCATGCATGGGCAACTGCAT

gap opening gap extension • Gap opening penalty • Counted each time a gap is opened in an alignment • (some programs include the first extension into this penalty) • Gap extension penalty • Counted for each extension of a gap in an alignment Gap opening and extension penalties • Costs of gaps in alignments • We want to simulate as closely as possible the evolutionary mechanisms involved in gap occurence. • Example • Two alignments with identical number of gaps but very different gap distribution. We may prefer one large gap to several small ones • (e.g. poorly conserved loops between well-conserved helices) CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G

Gap opening and extension penalties • Example • With a match score of 1 and a mismatch score of 0 • With an opening penalty of 10 and extension penalty of 1, we have the following score: CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G gap opening gap extension 13 x 1 - 10 - 6 x 1 = -3 13 x 1 - 5 x 10 - 6 x 1 = -43

Statistical evaluation of results • Alignments are evaluated according to their score • Raw score • It's the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension) • Depends on the scoring system (substitution matrix, etc.) • Different alignments should not be compared based only on the raw score • It is possible that a "bad" long alignment gets a better raw score than a very good short alignment. • We need a normalised score to compare alignments ! • We need to evaluate the biological meaning of the score (p-value, e-value). • Normalised score • Is independent of the scoring system • Allows the comparison of different alignments • Units: expressed in bits

low score low score low score low score high score high score due to "luck" ... Statistical evaluation of results • Distribution of alignment scores - Extreme Value Distribution • Random sequences and alignment scores • Sequence alignment scores between random sequences are distributed following an extreme value distribution (EVD). Random sequences Pairwise alignments Score distribution Ala Val ... Trp obs score

Threshold significant alignment score x: our alignment has a great probability of being the result of random sequence similarity score y: our alignment is very improbable to obtain with random sequences Statistical evaluation of results • Distribution of alignment scores - Extreme Value Distribution • High scoring random alignments have a low probability. • The EVD allows us to compute the probability with which our biological alignment could be due to randomness (to chance). • Caveat: finding the threshold of significant alignments. score

100% 0% N 0 Statistical evaluation of results • Statistics derived from the scores • p-value • Probability that an alignment with this score occurs by chance in a database of this size • The closer the p-value is towards 0, the better the alignment • e-value • Number of matches with this score one can expect to find by chance in a database of this size • The closer the e-value is towards 0, the better the alignment • Relationship between e-value and p-value: • In a database containing N sequences e = p x N

Diagonal plots or Dotplot • Concept of a Dotplot • Produces a graphical representation of similarity regions. • The horizontal and vertical dimensions correspond to the compared sequences. • A region of similarity stands out as a diagonal. Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

Tissue-Type plasminogen Activator A A’ B C D A B C D Urokinase-Type plasminogen Activator Reading a Dotplot • As simple as projecting the diagonals onto the axis. Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

Dotplot limitations • It's a visual aid. The human eye can rapidly identify similar regions in sequences. • It's a good way to explore sequence organisation. Between 2 different sequences or Inside the same sequence (ssDNA repeats, RNA stem loops, etc) • It does not provide an alignment.

Finding an alignment • Alignment algorithms • An alignment program tries to find the best alignment between two sequences given the scoring system. • This can be seen as trying to find a path through the dotplot diagram including all (or the most visible) diagonals. Alignment types • Global Alignment between the complete sequence A and the • complete sequence B • Local Alignment between a sub-sequence of A an a sub- • sequence of B • Computer implementation (Algorithms) • Dynamic programing • Global Needleman-Wunsch • Local Smith-Waterman

Global alignment (Needleman-Wunsch) • Example • Global alignments are very sensitive to gap penalties • Global alignments do not take into account the modular nature of proteins Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator Global alignment:

Local alignment (Smith-Waterman) • Example • Local alignments are more sensitive to the modular nature of proteins • They can be used to search databases Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator Local alignments:

Algorithms for pairwise alignments • Web resources • LALIGN - pairwise sequence alignment: • www.ch.embnet.org/software/LALIGN_form.html • PRSS - alignment score evaluation: • www.ch.embnet.org/software/PRSS_form.html • Concluding remarks • Substitution matrices and gap penalties introduce biological information into the alignment algorithms. • It is not because two sequences can be aligned that they share a common biological history. The relevance of the alignment must be assessed with a statistical score. • There are many ways to align two sequences. • Do not blindly trust your alignment to be the only truth. Especially gapped regions may be quite variable. • Sequences sharing less than 20% similarity are difficult to align: • You enter the Twilight Zone (Doolittle, 1986) • Alignments may appear plausible to the eye but are no longer statistically significant. • Other methods are needed to explore these sequences (i.e: profiles)

Acknowledgments & References Laurent Falquet, Lorenza Bordoli ,Volker Flegel, Frédérique Galisson References • Ian Korf, Mark Yandell & Joseph Bedell, BLAST, O’Reilly • David W. Mount, Bioinformatics, Cold Spring Harbor Laboratory Press • Jean-Michel Claverie & Cedric Notredame, Bioinformatics for Dummies, Wiley Publishing

Pairwise sequence alignments