Multiple Sequence Alignment

Multiple Sequence Alignment Julie Thompson Laboratory of Integrative Bioinformatics and Genomics IGBMC, Strasbourg, France julie@igbmc.fr

Multiple Sequence Alignment • Introduction: what is a multiple alignment? • Multiple alignment construction • Traditional approaches: optimal, progressive • Alignment parameters • Iterative and co-operative approaches • Multiple alignment analysis • Quality analysis/error detection • Conserved/homologous regions • Multiple alignment applications

lnk_rat crk1_mouse nck_human ht16_hydat pip5_human fer_human 1ab2 1mil 1blj 1shd 1lkkA 1csy 1bfi 1gri What is a multiple alignment? • a representation of a set of sequences, where equivalent residues (e.g. functional, structural) are aligned in rows or more usually columns Example: part of an alignment of SH2 domains from 14 sequences * conserved identical residues : conserved similar residues

What is a multiple alignment? conserved residues secondary structure conservation profile

Multiple Alignment Construction • Optimal multiple alignment example : MSA (Lipman et al. 1989, Gupta et al. 1995)

Optimal multiple alignment Extension of dynamic programming for 2 sequences => N dimensions Example : alignment of 3 sequences Problem : calculation time and memory requirements Time proportional to Nk for k sequences of length N => limited to less than 10 sequences Alignment of 5 sulfate binding proteins, length 224-263 residues: MSA OMA ClustalW >12hours 62.9min 0.6sec

Multiple Alignment Construction • Optimal multiple alignment MSA, OMA • Progressive multiple alignment ClustalW (Thompson et al. NAR. 1994) ClustalX (Thompson et al. NAR. 1997)

Problem : Start with which sequences ? How to decide order of alignment ? • first align the most closely related sequences How to measure the similarity of the sequences ? • align all the sequences pairwise • calculate the similarity between each pair from the alignment Progressive multiple alignment Idea : Progressively align pairs of sequences (or groups of sequences)

Progressive multiple alignment 1) Pairwise alignments of all sequences The alignment can be obtained by : - local or global method - dynamic programming or heuristic method (eg. K-tuple count) Hbb_human 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... |.| :|. | | |||| . | | ||| |: . :| |. :| | ||| Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ... Ex : local pairwise alignments of globin sequences Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... | |. |||.|| ||| ||| :|||||||||||||||||||||:|||||| Hbb_horse 1 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ... Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ... || :| | | | || | | ||| |: . :| |. :| | |||. Hbb_horse 3 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

Hbb_human Hbb_horse Hba_human Hba_horse Myg_phyca Glb5_petma Lgb2_lupla 1 2 3 4 5 6 7 Progressive multiple alignment Example in ClustalW/X : distance between 2 sequences = 1- 2) Construction of a distance matrix No. identical residues No. aligned residues - .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 - .81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 - 1 Ex : 7 globin sequences 2 3 4 5 6 7

Progressive alignment following a guide tree Progressive alignment using sequential branching Hbb_human .081 Hba_human 1 2 .226 Hbb_horse .084 Hba_horse 2 .061 3 Hba_human .055 Hbb_horse 3 1 .219 .015 4 Hba_horse .065 4 Hbb_human Myg_phyca .062 .398 Glb5_petma 5 5 Glb5_petma 6 6 .389 Myg_phyca Lgb2_lupla .442 Lgb2_lupla Progressive multiple alignment • Sequential branching • Construction of a ‘guide tree’ • - Neigbor-Joining (NJ) • - UPGMA • - Maximum likelihood 3) Decide order of alignment

Progressive multiple alignment 4) Progressive multiple alignment The sequences are aligned progressively (global or local algorithm) : - alignment of 2 sequences - alignment of 1 sequence and a profile (group of sequences) - alignment of 2 profiles (groups of sequences) xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx

Progressive multiple alignment H1 H3 H2 H4 H6 H5 H7

Progressive multiple alignment Global Local SB SBpima multal NJ clustalx UPGMA ML multalign pileup MLpima SB - sequential branching UPGMA- Unweighted Pair Grouping Method ML - maximum likelihood NJ - neighbor-joining

A C G T A 2 -2 -1 -2 C -2 2 -2 -1 G -1 -2 2 -2 T -2 -1 -2 2 Alignment parameters : similarity matrices Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues For nucleotide sequences: Transitions (A-G or C-T) are more frequent than transversions (A-T or C-G) More complex matrices exist where matches between ambiguous nucleotides are given values whenever there is any overlap in the sets of nucleotides represented

Alignment parameters : similarity matrices For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc. Matrices are generally constructed by observing the mutations in large sets of alignments, either sequence-based or structure-based Matrices range from strict ones for comparing closely related sequences to soft ones for very divergent sequences. e.g. PAM250 corresponds to an evolutionary distance of 250%, or approximately 80% residue divergence PAM1 corresponds to less than 1% divergence

Alignment parameters : similarity matrices A single best matrix does not exist! • Altschul, 1991 suggests PAM250 for related sequences, PAM120 when the sequences are not known to be related and PAM40 to search for short segments of highly similar sequences. • Henikoff, Henikoff, 1993 suggest Blosum62 as a good all-round matrix, Blosum45 for more divergent sequences and Blosum100 for strongly related sequences • ClustalW automatically selects a suitable matrix depending on the observed pairwise % identity: By default: ID >35% Gonnet 80 35%>ID >25% Gonnet 250 <25%ID Gonnet 350

Alignment parameters : gap penalties • A gap penalty is a cost for introducing gaps into the alignment, corresponding to insertions or deletions in the sequences SFGDLSNPGAVMG HF-DLS-----HG • proportional gap costs charge a fixed penalty for each residue aligned with a gap - the cost of a gap is proportional to its length: GAP_COST=ukwhere k is the length of gap • linear or ‘affine’ gap costs define a cost for introducing or ‘opening’ a gap, plus a length-dependent ‘extension’ cost GAP_COST=v+ukwhere v is the gap opening cost, u is the gap extension cost

HLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDL QLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDL VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLS Alignment parameters : gap penalties • ClustalW uses position-specific gap penalties to make gaps more or less likely at different positions in the alignment • Gap penalties are lowered at existing gaps and increased near to existing gaps • Gap penalties are lowered in hydrophilic stretches • Otherwise, gap opening penalties are modified according to their observed relative frequencies adjacent to gaps (Pascarella & Argos, 1992) Goal is to introduce gaps in sequence segments corresponding to flexible regions of the protein structure

Multiple Alignment Construction • Optimal multiple alignment MSA, OMA • Progressive multiple alignment ClustalW,ClustalX • Iterative multiple alignment PRRP (Gotoh, 1993) SAGA (Notredame et al. NAR. 1996) DIALIGN (Morgenstern et al. 1999) HMMER (Eddy 1998), SAM (Karplus et al. 2001)

converged? Iterative refinement PRRP(Gotoh, 1993) refines an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them. divide sequences into 2 groups pairwise profile alignment profile 1 refined alignment initial alignment Global progressif profile 2 no

Genetic Algorithms SAGA (Notredame et al.1996) evolves a population of alignments in a quasi evolutionary manner, iteratively improving the fitness of the population

Segment-to-segment alignment Dialign (Morgenstern et al. 1996) compares segments of sequences instead of single residues 1. construct dot-plots of all possible pairs of sequences Sequence i Sequence j 2. find a maximal set of consistent diagonals in all the sequences .......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek.......... ........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai......... ..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp.. 3. Local alignment - residues between the diagonals are not aligned

Multiple alignment methods Progressive Global Local SB SBpima multal NJ clustalx UPGMA ML multalign pileup MLpima prrp Genetic Algo. HMM dialign saga hmmt Iterative

m u l t a l m u l t a l i g n p i l e u p c l u s ta l x p r rp s a ga h mmt M L p i ma SB p im a d i a l i g n Comparison of programs League Table based on BAliBASE benchmark database Reference 1: < 6 sequences Reference 5: long insertions Reference 4: long N/C terminal extensions Reference 3: several sub-families Reference 2: a family with an orphan < 100 résidues > 400 résidues Tous All N / A N / A N / A N / A GLOBAL iterative N / A N / A LOCAL iterative • Iterative algorithms can improve alignment quality, but can be slow • Global algorithms work well when sequences are homologous over their full lengths, local algorithms are better for non-colinear sequences Thompson et al. 1999

Multiple Alignment Construction • Optimal multiple alignment MSA, OMA • Progressive multiple alignment ClustalW,ClustalX • Iterative multiple alignment PRRP, SAGA, DIALIGN, HMMER, SAM • Co-operative multiple alignment • T-COFFEE (Notredame et al. 2000) http://igs-server.cnrs-mrs.fr/Tcoffee/ • DbClustal (Thompson et al. 2000) http://www-igbmc.u-strasbg.fr/BioInfo/ • MAFFT (Katoh et al. 2002) http://www.biophys.kyoto-u.ac.jp/˜katoh/programs/align/mafft/ • MUSCLE (Edgar, 2004) http://www.drive5.com/muscle • Probcons (Do et al. 2005) • Kalign (Lassmann et al. 2005)

Ballast Anchors DbClustal Alignment Query Sequence Anchors DbClustal http://bips.u-strasbg.fr/PipeAlign/ Blast Database Search Query Sequence Database Hits Domain A Domain B Domain C

Comparaison ClustalW / DbClustal ClustalW DbClustal

MAFFT • Local homologous segments detected using a Fast Fourier Transform • Pairwise alignments are performed using restricted global dynamic programming • Multiple alignment is built up using a progressive algorithm, similar to ClustalW • Multiple alignment is then iteratively refined by dividing alignment into 2 parts and realigning

GLWGKAAAEEEGLWLFF—- --KGVFGAEQEGLFVFFGG K=2 -GLWGKAAAEEEGLWLFF KGVFGAEQEGLFVFFGG- K=-1 MAFFT Pairwise alignments c(k) k -1 2 1. Fast Fourier Transform to detect local conserved segments 2. Segment Level Dynamic Programming to select ‘consistent’ segments 3. Fix residues at the centre of each segment pair and realign between fixed points (white regions only)

ClustalW (1994) Dialign (1996) Mafft (2002) Probcons (2005) State-of-the-art • Co-operative algorithms have led to significant improvements… Ref 11 <20% ID BAliBASE 3 : Ref 12 20-40% ID Ref 5 insertions Ref 2 orphan Ref 4 extensions Ref 3 sub-families … but none of the methods currently available are capable of producing high-quality alignments for all test cases Thompson et al. 2005, 2006

RNA alignment methods • Comparison using ‘BRAliBASE’ RNA structure alignments (Gardner et al, 2005) • Some more recent methods: • Sequence: R-Coffee (Wilm, 2008), MAFFT (Katoh, 2008) • Structure: LARA (Bauer, 2007), FoldalignM (Torarinsson, 2007), SCARNA (Tabei, 2008) • Above 60% identity, sequence and structure based approaches have similar scores • Algorithms incorporating structural information outperform pure sequence methods. However, these algorithms are computationally demanding which severely limits their use in practice.

DNA alignment methods • Complete genomes • Local alignments (BlastZ, MultiZ, MUMmer,…) • Global alignments (MGA, Multi-LAGAN,MAVID, MAUVE,MAP2, Mulan,…) Reviewed in Dewey and Pachter, Human Molecular Genetics, 2006

Multiple alignment analysis • Are the sequences correctly aligned? • Quality analysis: alignment objective functions (SP, NorMD) • error detection and correction (RASCAL, Refiner) • Are the sequences in the alignment homologous? • Conserved/homologous regions (MCOFFEE, LEON) • Conserved (functional) residues

Objective functions Sum-of-pairs (Carrillo, Lipman, 1988) : Sum of scores for all pairs of sequences Blosum62 N C N 6 -3 C -3 9 Seq1-2 3 pairs N-N 3x6=18 Sequence 1 N N N Sequence 2 N N N Sequence 3 N N C Sequence 4 N C C Seq1-3 2 pairs N-N, 1 pair N-C 2x6+(-3)=9 Seq1-4 1 pair N-N, 2 pairs N-C 6+2x(-3)=0 Seq2-3 2 pairs N-N, 1 pair N-C 2x6+(-3)=9 Seq2-4 1 pair N-N, 2 pairs N-C 6+2x(-3)=0 Seq3-4 1 pair N-N, 1 pair N-C, 1 pair CC 6+(-3)+9=12 48 • Information content (Hertz et al, 1999) • Entropy column scores (between 0 and 1), sum for all columns in the alignment • norMD (Thompson et al, 2001) • Column scores • normalisation for sequence set to be aligned (number, length, similarity) • <0.3 bad alignment • 0.3-0.7 some local errors • >0.7 good alignment

‘HIGH’ H8 ‘KMSKS’ 1exd Archeal/ Eukaryotic GluRS + GlnRS Bacterial GluRS 1gln 1.0 1gln 1exd 0.5 Objective functions: NorMD Window length = 8 Window length = 40

Define sequence groups with the Secator program Wicker N. et al. (2001). Define core blocks : regions with average NorMD_sw above a specified threshold Calculate a Gribskov profile for each block in each group Error detection and correction • RASCAL (Thompson et al, 2003), Refiner (Chakrabati et al, 2006) RASCAL

HExxH Error detection and correction • RASCAL, errors within core blocks metalloprotease

DxxxG[AST]GxF[ILV] DxxxG[AST]GxF[ILV] Error detection and correction • RASCAL, errors between core blocks methyltransferase

Homology detection methods • Sequence percent identity: • >30% identity  sequences are homologous • 15-30% identity  ‘twilight zone’ • local analysis of positional conservation • AL2CO (Pi, Grishin, 2001), SEGID (Wang,Zu,2003), NorMD • Conserved regions • LEON (Thompson et al, 2004), MCOFFEE (Moretti et al, 2007)

Homology analysis with LEON • vertical analysis :sequence clustering, intermediate sequences • horizontal analysis : residue conservation, motif context information • composition analysis : prediction of compositionally biased segments • Homologous regions are delineated • Removal of sequences non-homologous to query

Homology analysis with LEON Query sequence: DKK1_HUMAN BlastP results :

Pfam : Dickkopf N-terminal domain Colipase Colipase C-terminal domain Homology analysis with LEON dkk1 dkk2 dkk3 Prokinecitin/ Intestinal toxin Lipase protein cofactor

For a training set of 510 potential targets : No. of targets with at least 1 PDB neighbour BlastP (E<10-7) 142 (28%) BlastP (E<10-4) 166 (33%) PipeAlign (BlastP E<10) 196 (38%) PipeAlign (PDB-Blast) 223 (44%) Structural proteomics : target characterisation Detection of structural homologs for targets in the SPINE (Structural Proteomics in Europe) project

Conserved residue analysis • Active site residues are under evolutionary pressure to maintain their functional integrity and undergo fewer mutations than less functionally important amino acids • Methods: • Evolutionary trace (Lichtarge et al, 1996): sequence conservation patterns in homologous proteins are mapped onto the protein surface to generate clusters identifying functional interfaces

Conserved residue analysis • Comparison of sequence-based methods • FRcons combines information : • conservation at each site • amino acid distribution • predicted secondary structure (ss) • predicted relative solvent accessibility (rsa) FRcons: Fischer et al. Bioinformatics 2008

OrdAli : Ordered Alignment Analysis color scheme • residues conserved in all sequences in family • structural or functional importance: characteristic motifs • residues conserved within a sub-group of sequences • discriminant residues

Euc Arc Bac Euc Arc Bac Schematic alignment of aspartyl-tRNA synthetases • universal proteins, play a key role in traduction 180 200 220 240 260 280 300 320 Anticodon binding domain 340 360 380 400 420 440 460 480 500 520 540 560 P L Q PQ KQ R Motif I Flipping Motif II loop Catalytic core I Insertion domain 690 710 730 750 770 790 810 830 850 870 890 930 H G Euc Family conserved Archaea+Bacteria Archaea+Eukaryote Arc Bac Motif III Catalytic core II

Multiple Sequence Alignment