1 / 33

Multiple Alignment

Multiple Alignment. Stuart M. Brown NYU School of Medicine. Pairwise Alignment. The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. The best solution seems to be an approach called Dynamic Programming. Dynamic Programming.

callie
Download Presentation

Multiple Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Alignment Stuart M. Brown NYU School of Medicine

  2. Pairwise Alignment • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. • The best solution seems to be an approach called Dynamic Programming.

  3. Dynamic Programming • Dynamic Programming is a very general programming technique. • It is applicable when a large search space can be structured into a succession of stages, such that: • the initial stage contains trivial solutions to sub-problems • each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage • the final stage contains the overall solution

  4. Global vs. Local Alignments • Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. • Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

  5. GAP • The GCG program GAP implements the Needleman and Wunsch Global alignment algorithm. • Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. • GAP is useful when you want to force two sequences to align over their entire length

  6. BESTFIT • The GCG program BESTFIT implements the Smith-Waterman local alignment algorithm. • FASTA and BLAST are local alignment algorithms • NCBI has a “BLAST 2 Sequences” feature on its website: http://www.ncbi.nlm.nih.gov/gorf/bl2.html

  7. Pairwise Alignment on the Web • The ALIGN global alignment program is available at several servers: http://molbiol.soton.ac.uk/compute/align.html http://www2.igh.cnrs.fr/bin/align-guess.cgi • LALIGN local alignment program is available at several servers: http://www2.igh.cnrs.fr/bin/lalign-guess.cgi http://www.ch.embnet.org/software/LALIGN_form.html • LFASTA uses FASTA for local alignment of 2 sequences: http://pbil.univ-lyon1.fr/lfasta.html • BLAST 2 Sequences (NCBI) http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html

  8. Multiple Alignments • In theory, making an optimal alignment between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible. • The problem increases exponentially with the number of sequences involved (the product of the sequence lengths)

  9. Optimal Alignment • For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations. • Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

  10. Progressive PairwiseMethods • Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups. • This is an approximate method!

  11. PILEUP • PILEUP is the multiple alignment program in the GCG package • CLUSTAL is another popular program (also available on the RCR server) that uses a similar algorithm.

  12. The PILEUP Algorithm • First, PILEUP calculates approximate pairwise similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure). • Then the most similar pairs of sequences are aligned. • Averages (similar to consensus sequences) are calculated for the aligned pairs. • New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.

  13. PILEUP Considerations • Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment. • PILEUP paramaters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix. • PILEUP will refuse to align sequences that require too many gaps or mismatches. • PILEUP will take quite a while to align more than about 10 sequences

  14. Instructions for running PILEUP • PILEUP uses a list of sequence files as input • You can use output from a FASTA or LOOKUP search as a list or make your own list in a text editor • A list file can include files from your own directory and/or GCG database files.

  15. LIST file format • List files always begin with two dots .. .. gp:S31321 gp:Yno3_Yeast S51900.pep Yan2_Schpo Ypd1_Caeel A36205 Mpp1_Rat begin:100 end:345 B46665.pep Ymxg_Bacsu begin:150 end:464 A48043.pep • List files can also include Begin and End positions within a sequence

  16. PILEUP @myseqs.list • Now at the > prompt, type PILEUP and the name of the file that is your list of sequence names. • However, GCG requires that you must precede the name of your list file with the @ character. • So the command looks like this: > PILEUP @myseqs.list

  17. PILEUP Output > more myseqs.msf 1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS 1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP

  18. PILEUP options • For a first try, take the default options, but give the output file a meaningful name. • If you don’t get a good alignment, try a less stringent matrix and/or gap penalties. > PILEUP -matr=oldpep.cmp • It is a good idea to run PILEUP in batch mode if you have more than 10 sequences to align: > PILEUP -bat

  19. CLUSTAL • CLUSTAL is a stand-alone (i.e. not integrated into GCG) multiple alignment program that is superior in some respects to PILEUP • Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure. • it can re-align just selected sequences or selected regions in an existing alignment • It can compute phylogenetic trees from a set of aligned sequences. • There are also Mac and PC versions with a nice graphical interface (CLUSTALX).

  20. Using CLUSTAL • On mcrcr0 type: clustal • CLUSTAL can only work with sequences in multi-sequence FASTA format. • The GCG program TOFASTA can convert lists of file names into FASTA multi-sequence format.

  21. Multiple Alignment tools on the Web • There are a variety of multiple alignment tools available for free on the web. • CLUSTAL is available from a number of sites (with a variety of restrictions) • Other algorithms are available too • Watch out for “experimental” algorithms; there may be a good reason why you have never heard of some oddball program

  22. Some URLs • EMBL-EBI http://www.ebi.ac.uk/clustalw/ • BCM Search Launcher: Multiple Alignment http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html • Multiple Sequence Alignment for Proteins (Wash. U. St. Louis) http://www.ibc.wustl.edu/service/msa/

  23. Editing Multiple Alignments • There are a variety of tools that can be used to modify a multiple alignment. • These programs can be very useful in formatting and annotating an alignment for publication. • An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

  24. GCG alignment editors • Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP. • Nicely shaded printouts can be produced with PRETTYBOX • GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.

  25. Other editors • The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality. • Many “comprehensive” molecular biology programs include multiple alignment functions: • MacVector, OMIGA, Vector NTI, and GeneTool/PepTool all include a built-in version of CLUSTAL

  26. SeqVu

  27. Editors on the Web • Check out CINEMA (Colour INteractive Editor for Multiple Alignments) • It is an editor created completely in JAVA (old browsers beware) • It includes a fully functional version of CLUSTAL, BLAST,and a DotPlot module • http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

More Related