1 / 45

Multiple sequence alignment Why?

Multiple sequence alignment Why?. It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly)

yuval
Download Presentation

Multiple sequence alignment Why?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple sequence alignmentWhy? • It is the most important means to assess relatedness of a set of sequences • Gain information about the structure/function of a query sequence (conservation patterns) • Construct a phylogenetic tree • Putting together a set of sequenced fragments (Fragment assembly) • Recognise alternative splice sites • Many bioinformatics methods depend on it (secondary/tertiary structure)

  2. Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY

  3. Pairwise alignment • Now we know how to do it: • How do we get a multiple alignment (three or more sequences)? • Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..

  4. Multi-dimensional dynamic programming(Murata et al. 1985)

  5. Simultaneous Multiple alignmentMulti-dimensional dynamic programming MSA (Lipman et al., 1989, PNAS86, 4412) • extremely slow and memory intensive • up to 8-9 sequences of ~250 residues DCA (Stoye et al., 1997, CABIOS13, 625) • still very slow

  6. Alternative multiple alignment methods • Biopat (Hogeweg Hesper 1984, first method ever) • MULTAL (Taylor 1987) • DIALIGN (Morgenstern 1996) • PRRP (Gotoh 1996) • Clustal (Thompson Higgins Gibson 1994) • Praline (Heringa 1999) • T-Coffee (Notredame Higgins Heringa 2000) • HMMER (Eddy 1998) [Hidden Markov Model] • SAGA (Notredame Higgins1996) [Genetic algorithm]

  7. Progressive multiple alignment general principles 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities Guide tree Multiple alignment

  8. General progressive multiple alignment technique(follow generated tree) d 1 3 1 3 2 5 1 3 2 5 1 root 3 2 5 4

  9. Progressive multiple alignment Problem: Accuracy is very important Errors are propagated into the progressive steps “Once a gap, always a gap” Feng & Doolittle, 1987

  10. Pair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)

  11. Multiple alignment profilesGribskov et al. 1987 i A C D    W Y 0.3 0.1 0    0.3 0.3 Gap penalties 1.0 0.5 Position dependent gap penalties

  12. Profile-sequence alignment sequence profile ACD……VWY

  13. Profile-profile alignment profile A C D . . Y profile ACD……VWY

  14. Clustal, ClustalW, ClustalX • CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree. • Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. • Further carefully crafted heuristics include: • (i) local gap penalties • (ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment • (iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered. • CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

  15. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  16. Pre-profile generation 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Cut-off Pre-profiles Pre-alignments 1 A C D . . Y 1 2 3 4 5 2 2 A C D . . Y 1 3 4 5 5 A C D . . Y 1 5 2 3 4

  17. Pre-profile alignment Pre-profiles 1 A C D . . Y 2 A C D . . Y Final alignment 3 A C D . . Y 1 2 3 4 5 4 A C D . . Y 5 A C D . . Y

  18. Pre-profile alignment 1 2 1 3 4 5 2 2 1 3 4 Final alignment 5 3 1 1 3 2 2 4 3 5 4 5 4 4 1 2 3 5 5 1 5 2 3 4

  19. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  20. Protein structure hierarchical levels SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold)

  21. One of the Molecular Biology Dogma’s “Structure more conserved than sequence”

  22. Secondary structure-induced alignment

  23. Using secondary structure for alignment Dynamic programming search matrix Amino acid exchange weights matrices MDAGSTVILCFV HHHCCCEEEEEE M D A A S T I L C G S H H H H C C E E E C C H H C C E E Default

  24. Flavodoxin-cheYUsing predicted secondary structure 1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeeehhhhhhhhhhhhhhh eeeeeeeeeeeehhhhhh eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeeehhhhhhhhhhhhhheeeeeehhhhhh eeeeeeehhhhhh eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeeehhhhhhhhhhhhhheeeeeeeeeehhhhhhh heeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeeehhhhhhhhhhhhhheeeeehhhhhhhhhhheeeeehhhhhhh hheeeee 2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhheeehhhhhhhhhheeeeeehhhhhhhhheeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhheee hhh hhhhhhheeeee hhhhheeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eeehhhhhhhhhhhhhhhhhhhhhhheeeeehhhhhhhhheeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeeehhhhhhhhhhhhhhhhhhhhhheeeee hhhhheeeee 4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhhhst t tt eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhheeeee hhhhhhhh eeeeeeeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eeehhhhhhhhhhhhhh eeeeee hhhhhhhhhheeee hhhhhhhhheeeee 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshhhhhhhhhheeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhh FLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhheeeeeeeeeehhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhheeeeehhhhhhhhhhh FLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeeee eee FLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhheeeeeeehhhhhhhhhhh 2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhheeeehhhhhhhhhhhhhhhh FLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhheeeehhhhhhhhhhhhhhhhhh FLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- ehhhhhhhhhhhhhheeeee hhhhhhhhhhh FLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhheeeehhhhhhh hhhhhhhhhhhh 4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhht FLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhheeeee eeeeh hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhheeeeehhhh hhhhhhhhhhhhhhh h 3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt seeees s hhhhhhhhhhhhhhht G

  25. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  26. Globalised local alignment 1.Local (SW) alignment (M + Po,e) + = 2.Global (NW) alignment (no M or Po,e) Double dynamic programming

  27. M = BLOSUM62, Po= 0, Pe= 0

  28. M = BLOSUM62, Po= 12, Pe= 1

  29. M = BLOSUM62, Po= 60, Pe= 5

  30. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  31. Matrix extension • T-Coffee • Tree-based Consistency Objective Function For alignmEnt Evaluation • Cedric Notredame • Des Higgins • Jaap HeringaJ. Mol. Biol., 302, 205-217;2000

  32. Matrix extension – T COFFEE 2 1 3 1 4 1 3 2 4 2 4 3

  33. Integrating alignment methods and alignment information with T-Coffee • Integrating different pair-wise alignment techniques (NW, SW, ..) • Combining different multiple alignment methods (consensus multiple alignment) • Combining sequence alignment methods with structural alignment techniques • Plug in user knowledge

  34. Using different sources of alignment information Structure alignments Clustal Clustal Dialign Lalign Manual T-Coffee

  35. Search matrix extension

  36. T-Coffee • Combine different alignment techniquesby adding scores: • W(A(x), B(y)) = S(A(x), B(y)) • A(x) is residue x in sequence A • summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)) • S is sequence identity percentage of the associated alignment • Combine direct alignment seqA- seqB with each seqA-seqI-seqB: • W’(A(x), B(y)) = W(A(x), B(y)) + • IA,BMin(W(A(x), I(z)), W(I(z), B(y))) • Summation over all third sequences I other than A or B

  37. T-Coffee Other sequences Direct alignment

  38. Search matrix extension

  39. Evaluating multiple alignments • Conflicting standards of truth • evolution • structure • function • With orphan sequences no additional information • Benchmarks depending on reference alignments • Quality issue of available reference alignment databases • Different ways to quantify agreement with reference alignment (sum-of-pairs, column score) • “Charlie Chaplin” problem

  40. Evaluating multiple alignments • As a standard of truth, often a reference alignment based on structural superpositioning is taken

  41. Evaluation measures Query Reference Column score Sum-of-Pairs score

  42. Evaluating multiple alignments SP BAliBASE alignment nseq * len

  43. Summary • Weighting schemes simulating simultaneous multiple alignment • Profile pre-processing (global/local) • Matrix extension (well balanced scheme) • Smoothing alignment signals • globalised local alignment • Using additional information • secondary structure driven alignment • Schemes strike balance between speed and sensitivity

  44. References • Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem.23, 341-364. • Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217. • Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

  45. Where to find this….http://www.ibivu.cs.vu.nl/teaching

More Related