Multiple Sequence Alignment

Multiple Sequence Alignment

Alignment can be easy or difficult Easy Difficult due to insertions or deletions (indels)

Homology: Definition • Homology:similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics. • AnAlignment is an hypothesis of positional homology between bases/Amino Acids.

Multiple Sequence Alignment- Goals • To generate a concise, information-rich summary of sequence data. • Sometimes used to illustrate the dissimilarity between a group of sequences. • Alignments can be treated as models that can be used to test hypotheses. • Used to identify homologous residues within sequences.

Multiple sequence alignments - problems • All sequences show some similarity (even random sequences). • Similarity levels might be high in some parts of the sequence and low in other parts. • Sequences might show substantial length variation and presence/absence of various domains.

SSU rRNA • Structural RNA (not translated) • Found in the small ribosomal subunit. • Widely-used for phylogeny reconstruction (found in every species) • Contains stem and loop structures. • Stem structures usually conform to watson-crick base pairing.

Alignment of 16S rRNA can be guided by secondary structure Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided by Tertiary Structure Interactions Homo sapiens DjlA protein Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods • 3 main methods of alignment: • Manual (using custom-built text editors). • Automatic (using custom-built alignment software). • Combined

Manual Alignment - reasons • Might be carried out because: • Alignment is easy. • There is some extraneous information (structural). • Automated alignment methods have encountered the local minimum problem. • An automated alignment method can be “improved”.

Local minimum GARFIELDTHEFAT---CAT GARFIELDTHEFATFATCAT

Dotplots • The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously. • Lets consider a dotplot between sperm whale and human myoglobins • Sperm whale myoglobin • GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG • human myoglobin • VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

Dotplot example sperm whale vs human myg • Sperm whale myoglobin • G L S D G E W Q L V ... • V * • L * * • S * • E * • G * * • E * • W * • Q * • L * * • V * * • . • . • . Human myoglobin • Put one sequence on top • the other on the side • where residues are identical put a dot • Diagonal lines of dots show similarities • just do the first 10 amino acids of each • make a table with • whale sequence on top • human sequence on the side

Dotplot example sperm whale vs human myg • Sperm whale myoglobin • G L S D G E W Q L V ... • V * • L * * • S * • E * • G * * • E * • W * • Q * • L * * • V * * • . • . • . Human myoglobin • This is the result for the whole sequence • It is easy to see that the diagonal is a line of dots. • So sperm whale and human myoglobin are very similar • But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well 16

Dotplot example sperm whale vs human myg • can smooth noise using a sliding window which considers neighbouring residues as well • Have done this here can see the diagonal is highly similar • Also instead of using using a simple identity use a scoring matrix

Dotplots in practice • The best tool is an applet* called dotlet • www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • www.bip.bham.ac.uk/dotlet/Dotlet.html • *an applet is a program that runs in a web browser. This means that you can produce dotplots within a netscape/IE window. • Dotplots are often useful to identify things like repeated domains or duplications in big proteins...

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein. • Protein has many repeats • SLIT_DROME (P24014):MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY • Perform a dotplot of the SLIT protein against itself www.bio.bham.ac.uk/dotlet/Dotlet.html.

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein Swiss-prot entry 20 For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming 2 methods: • Dynamic programming • Consider 2 protein sequences of 100 amino acids in length. • If it takes 1002 seconds to exhaustively align these sequences, then it will take 1003 seconds to align 3 sequences, 1004 to align 4 sequences...etc. • More time than the universe has existed to align 20 sequences exhaustively. • Progressive alignment

Progressive Alignment • Devised by Feng and Doolittle in 1987. • Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment. • Requires n-1+n-2+n-3...n-n+1 pairwise alignments as a starting point • Most successful implementation is Clustal (Des Higgins). This software is cited 3,000 times per year in the scientific literature.

Overview of ClustalW Procedure CLUSTAL W Hbb_Human 1 - Hbb_Horse 2 .17 - Hba_Human 3 .59 .60 - Quick pairwise alignment: calculate distance matrix Hba_Horse 4 .59 .59 .13 - Myg_Whale 5 .77 .77 .75 .75 - Hbb_Human 4 2 3 Hbb_Horse Neighbor-joining tree (guide tree) Hba_Human 1 Hba_Horse Myg_Whale alpha-helices 1 PEEKSAVTALWGKVN--VDEVGG 4 2 3 Progressive alignment following guide tree 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 1 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ

ClustalW- Pairwise Alignments • First perform all possible pairwise alignments between each pair of sequences. There are (n-1)+(n-2)...(n-n+1) possibilities. • Calculate the ‘distance’ between each pair of sequences based on these isolated pairwise alignments. • Generate a distance matrix.

Path Graph for aligning two sequences.

Possible alignment • Scoring Scheme: • Match: +1 • Mismatch: 0 • Indel: -1 1 1 0 1 Score for this path= 2 0 -1

Alignment using this path 1 GATTC- GAATTC 1 0 1 0 -1

Optimal Alignment 1 Alignment using this path GA-TTC GAATTC 1 1 -1 1 1 Alignment score: 4 1

Optimal Alignment 2 Alignment using this path G-ATTC GAATTC 1 -1 1 1 1 Alignment score: 4 1

Alignment of 3 sequences

ClustalW- Guide Tree • Generate a Neighbor-Joining ‘guide tree’ from these pairwise distances. • This guide tree gives the order in which the progressive alignment will be carried out.

Neighbor joining method • The neighbor joining method is a greedy heuristic which joins at each step, the two closest sub-trees that are not already joined. • It is based on the minimum evolution principle. • One of the important concepts in the NJ method is neighbors, which are defined as two taxa that are connected by a single node in an unrooted tree Node 1 A B

What is required for the Neighbour joining method? Distance Matrix Distance matrix

First Step PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum Mosquito Spinach Rice Human Monkey

Calculation of New Distances After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55 Mon-Hum Spinach Human Monkey

Next Cycle Mos-(Mon-Hum) Mon-Hum Rice Spinach Mosquito Human Monkey

Penultimate Cycle Mos-(Mon-Hum) Spin-Rice Mon-Hum Rice Spinach Mosquito Human Monkey

Last Joining (Spin-Rice)-(Mos-(Mon-Hum)) Mos-(Mon-Hum) Spin-Rice Mon-Hum Rice Spinach Mosquito Human Monkey

Unrooted Neighbor-Joining Tree Human Spinach Monkey Mosquito Rice

Multiple Alignment- First pair • Align the two most closely-related sequences first. • This alignment is then ‘fixed’ and will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences, but their relative alignment remains unchanged.

ClustalW- Decision time • Consult the guide tree to see what alignment is performed next. • Align a third sequence to the first two Or • Align two entirely different sequences to each other. Option 1 Option 2

ClustalW- Alternative 1 If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences. • If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out. + ClustalW- Alternative 2 +

ClustalW- Progression • The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.

Progressive alignment - step 1 1 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgacagcta 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 2 3 4 5 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta

Progressive alignment - step 2 1 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgacagcta 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 2 3 4 5 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga

Progressive alignment - step 3 1 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta + 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 2 3 4 5 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga

Progressive alignment - final step 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga + 5. ctcgaacgatacgatgactagct 1 2 3 4 5 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga 5. -ctcga-acgatacgatgactagct-

ClustalW-Good points/Bad points • Advantages: • Speed. • Disadvantages: • No objective function. • No way of quantifying whether or not the alignment is good • No way of knowing if the alignment is ‘correct’.

ClustalW-Local Minimum • Potential problems: • Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. • Arbitrary alignment.

Increasing the sophistication of the alignment process. • Should we treat all the sequences in the same way? - even though some sequences are closely-related and some sequences are distant relatives. • Should we treat all positions in the sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.

Multiple Sequence Alignment