An Introduction to Bioinformatics

Special Topics BSC4933/5936Florida State UniversityThe Department of Biological Sciencewww.bio.fsu.edu An Introduction to Bioinformatics Sept. 18, 2003

Multiple Sequence Alignment & Analysis Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) More data yields stronger analyses — if done carefully! Mosaic ideas and evolutionary ‘importance.’

But first, let’s go over some details from Dynamic Programming — As we’ve seen, dynamic programming reduces the complexity of the alignment problem from N4N to N2, yet some details were glossed over. Most of the dynamic programming examples that we saw treated gaps the same, whether they were inside an alignment or at the beginning or end of the alignment, and whether or not they existed all by themselves or in a run of multiple occurrences. Well, truth be told, through lots of practical experience, we’ve learned life doesn’t behave that way! Not at all . . . . The programs, as implemented in all sequence analysis packages, by default do not penalize gaps at the beginning or end of an alignment, and they treat the first gap in a row differently than subsequent gaps: Total penalty = gap creation penalty + ( [ length of gap ] x [ gap extension penalty ] ) The so-called ‘affine’ function. Look like anything you recognize?

OK, back to Multiple Sequence Alignment —Applicability? • So what; why even bother? • Applications: • Probe, primer, and motif design; • Graphical illustrations; • Comparative ‘homology’ inference; • Molecular evolutionary analysis. • All right — well, how do you do it?

Dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix . . . . complexity=[sequence length]number of sequences

‘Global’ heuristic solutions See — MSA (‘global’ within ‘bounding box’) and PIMA (‘local’ portions only) on the multiple alignment page at the Baylor College of Medicine’s Search Launcher — http://searchlauncher.bcm.tmc.edu/ — but, severely limiting restrictions!

Multiple Sequence Dynamic Programming Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment.

Web resources for pairwise, progressive multiple alignment — http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html. http://pbil.univ-lyon1.fr/alignment.html http://www.ebi.ac.uk/clustalw/ http://searchlauncher.bcm.tmc.edu/ However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!

So, what else is available? Stand-alone ClustalW is available for most every operating system imaginable. And its graphical user interface ClustalX makes running it very easy. Dedicated biocomputing server software, such as the Wisconsin Package and it’s PileUp program and graphical user interface SeqLab are another powerful solution.

Reliability and the Comparative Approach — explicit homologous correspondence; manual adjustments based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeqLab and the Ribosomal Database Project: http://rdp.cme.msu.edu/html/.

Structural & Functional correspondence in the Wisconsin Package’s SeqLab —

Work with proteins!If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are verysimilar to each other. And they will require extensive hand editing and careful consideration.

Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.

Mask out uncertain areas —

Complications — Order dependence. Not that big of a deal. Substitution matrices and gap penalties. A very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option).

Complications cont. — Format hassles! Specialized format conversion tools such as GCG’s From’ and To’ programs and PAUPSearch. Don Gilbert’s public domain ReadSeq program.

Still more complications — Indels and missing data symbols (i.e. gaps) designation discrepancy headaches — ., -, ~, ?, N, or X . . . . . Help!

P-Loop The consensus and motifs — Conserved regions can be visualized with a sliding window approach and appear as peaks. Let’s concentrate on the first peak seen here to simplify matters.

The first GTP binding domain of EF 1 /Tu — A consensus isn’t necessarily the biologically “correct” combination. A simple consensus throws much information away! Therefore, motif definition.

The EF 1 /Tu P-Loop — Defined as: (A,G)x4GK(S,T). A one-dimensional ‘regular-expression’ of a conserved site. Not necessarily biologically meaningful. Motifs are limited in their ability to discriminate a residue’s ‘importance.’

So how do we include ‘all’ the information of a multiple sequence alignment, or of a region within an alignment, in a description that doesn’t throw anything away? • Enter — • for remote homology searching, the ‘profile’ . . . . • But we’ll have to wait until we hear about ‘normal’ homology searching first. So next week I’ll present the basics of the FastA and BLAST family of heuristic algorithms, and then after that I’ll review a lecture on interpreting search results given by William Pearson, the developer of FastA. And then, after that, the following week: • profile algorithms, incl. MEME’s, and HMMer’s. Conclusions —

And to ‘honk-my-own-horn’ a bit, check out Current Protocols in Bioinformatics from John Wiley & Sons, Inc: http://www.does.org/cp/bioinfo.html. They asked me to contribute a chapter on multiple sequence analysis using GCG software. Humana Press, Inc. also asked me to contribute. I’ve got two chapters in their Introduction to Bioinformatics — A Theoretical And Practical Approach: http://www.humanapress.com/Product.pasp?txtCatalog=HumanaBooks&txtCategory=&txtProductID=1-58829-241-X&isVariant=0. Both volumes were available early 2003. FOR EVEN MORE INFO... Participate in the lab for this course and/or my workshop series: http://bio.fsu.edu/~stevet/workshop.html. Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration.

Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research20, 2013-2018. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural Biology6, 361–365. Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics14, 755--763 Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution25, 351–360 . Genetics Computer Group (Copyright 1982-2003) Program Manual for the Wisconsin Package, Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc. Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.84, 4355-4358. Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology2, 459–472. Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering5, 35–41. Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2002) Florida State University, Tallahassee, Florida, U.S.A. http://paup.csit.fsu.edu/ distributed through Sinaeur Associates, Inc. http://www.sinauer.com/ Sunderland, Massachusetts, U.S.A. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research24, 4876–4882. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680. References —

An Introduction to Bioinformatics