Special Topics BSC4933/5936 Florida State University The Department of Biological Science www.bio.fsu.edu. An Introduction to Bioinformatics. Sept. 18, 2003. Multiple Sequence Alignment & Analysis. Steven M. Thompson
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Special Topics BSC4933/5936Florida State UniversityThe Department of Biological Sciencewww.bio.fsu.edu
Sept. 18, 2003
Steven M. Thompson
Florida State University School of Computational Science and Information Technology (CSIT)
More data yields stronger analyses — if done carefully!
Mosaic ideas and evolutionary ‘importance.’
As we’ve seen, dynamic programming reduces the complexity of the alignment problem from N4N to N2, yet some details were glossed over. Most of the dynamic programming examples that we saw treated gaps the same, whether they were inside an alignment or at the beginning or end of the alignment, and whether or not they existed all by themselves or in a run of multiple occurrences. Well, truth be told, through lots of practical experience, we’ve learned life doesn’t behave that way! Not at all . . . .
The programs, as implemented in all sequence analysis packages, by default do not penalize gaps at the beginning or end of an alignment, and they treat the first gap in a row differently than subsequent gaps:
Total penalty = gap creation penalty + ( [ length of gap ] x [ gap extension penalty ] )
The so-called ‘affine’ function. Look like anything you recognize?
N-dimensional matrix . . . .
complexity=[sequence length]number of sequences
MSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple alignment page at the
Baylor College of Medicine’s Search Launcher —
http://searchlauncher.bcm.tmc.edu/ — but,
severely limiting restrictions!
Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor-hood of only two sequences at a time.
All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment.
However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!
Stand-alone ClustalW is available for most every operating system imaginable. And its graphical user interface ClustalX makes running it very easy.
Dedicated biocomputing server software, such as the Wisconsin Package and it’s PileUp program and graphical user interface SeqLab are another powerful solution.
explicit homologous correspondence;
manual adjustments based on knowledge,
especially structural, regulatory, and functional sites.
Therefore, editors like SeqLab and
the Ribosomal Database Project:
Twenty match symbols versus four, plus similarity! Way better signal to noise.
Also guarantees no indels are placed within codons. So translate, then align.
Nucleotide sequences will only reliably align if they are verysimilar to each other. And they will require extensive hand editing and careful consideration.
Parologous versus orthologous;
genomic versus cDNA;
mature versus precursor.
Not that big of a deal.
Substitution matrices and gap penalties.
A very big deal!
Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option).
Specialized format conversion tools such as GCG’s From’ and To’ programs and PAUPSearch.
Don Gilbert’s public domain ReadSeq program.
Indels and missing data symbols (i.e. gaps) designation discrepancy headaches —
., -, ~, ?, N, or X
. . . . . Help!
A consensus isn’t necessarily the biologically “correct” combination.
A simple consensus throws much information away!
Therefore, motif definition.
A one-dimensional ‘regular-expression’ of a conserved site.
Not necessarily biologically meaningful.
Motifs are limited in their ability to discriminate a residue’s ‘importance.’
So how do we include ‘all’ the information of a multiple sequence alignment, or of a region within an alignment, in a description that doesn’t throw anything away?
They asked me to contribute a chapter on multiple sequence analysis using GCG software.
Humana Press, Inc. also asked me to contribute. I’ve got two chapters in their Introduction to Bioinformatics —
A Theoretical And Practical Approach:
Both volumes were available early 2003.FOR EVEN MORE INFO...
Participate in the lab for this course and/or my workshop series:
Contact me (firstname.lastname@example.org) for specific bioinformatics assistance and/or collaboration.
Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research20, 2013-2018.
Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural Biology6, 361–365.
Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics14, 755--763
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.
Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution25, 351–360 .
Genetics Computer Group (Copyright 1982-2003) Program Manual for the Wisconsin Package, Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc.
Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.84, 4355-4358.
Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology2, 459–472.
Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering5, 35–41.
Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2002) Florida State University, Tallahassee, Florida, U.S.A. http://paup.csit.fsu.edu/ distributed through Sinaeur Associates, Inc. http://www.sinauer.com/ Sunderland, Massachusetts, U.S.A.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research24, 4876–4882.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.References —