Florida State University — Bioinformatics Workshop #1

An Introduction to Multiple Sequence Alignment & Analysis thru GCG’s SeqLab Florida State University — Bioinformatics Workshop #1 Steven M. Thompson Florida State University School of Computational Science (SCS) Sept. 21, 2006, 5:30 PM

But first a prelude: My definitions — Biocomputing and computational biology are synonymous and describe the use of computers and computational techniques to analyze any biological system, from molecules, through cells, tissues, and organisms, all the way to populations. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, mechanism, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

And a ‘way’ to think about it: The reverse biochemistry analogy — from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round. Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural insights into a gene product, without the need to isolate and purify massive amounts of protein! Eventually you can go on to clone and express the gene based on that analysis using PCR techniques. The computer and molecular databases are an essential part of this process.

& cpu power — The exponential growth of molecular sequence databases Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 15514776 14584 1988 23800000 20579 1989 34762585 28791 1990 49179285 39533 1991 71947426 55627 1992 101008486 78608 1993 157152442 143492 1994 217102462 215273 1995 384939485 555694 1996 651972984 1021211 1997 1160300687 1765847 1998 2008761784 2837897 1999 3841163011 4864570 2000 11101066288 10106023 2001 15849921438 14976310 2002 28507990166 22318883 2003 36553368485 30968418 2004 44575745176 40604319 2005 56037734462 52016762 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Doubling time ~ 1 year!

Back to multiple sequence alignment — Applicability? • So what; why even bother? • Applications: • Probe/primer, and motif/profile design; • Graphical illustrations; • Comparative ‘homology’ inference; • Molecular evolutionary analysis. • OK — well, how do you do it?

Dynamic programming’s complexity increases exponentially with the number of sequences being compared — N-dimensional matrix . . . . complexity=[sequence length]number of sequences

‘Global’ heuristic solutions — See: MSA (‘global’ within ‘bounding box’) and PIMA (‘local’ portions only) on the multiple alignment page at the Baylor College of Medicine’s Search Launcher — http://searchlauncher.bcm.tmc.edu/ — but, severely limiting restrictions!

Multiple Sequence Dynamic Programming — Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment.

Reliability and the Comparative Approach — explicit homologous correspondence; manual adjustments based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeqLab and the Ribosomal Database Project: http://rdp.cme.msu.edu/index.jsp.

Structural & Functional correspondence in the Wisconsin Package’s SeqLab —

Work with proteins!If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are verysimilar to each other. And they will require extensive hand editing and careful consideration.

Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.

Mask out uncertain areas —

Complications — Order dependence. Not that big of a deal. Substitution matrices and gap penalties. A very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option).

Complications cont. — Format hassles! Specialized format conversion tools such as GCG’s SeqConv+ program and PAUPSearch wrapper. Don Gilbert’s public domain ReadSeq program.

Still more complications — Indels and missing data symbols (i.e. gaps) designation discrepancy headaches — ., -, ~, ?, N, or X . . . . . Help!

Web resources for pairwise, progressive multiple alignment — http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html. http://pbil.univ-lyon1.fr/alignment.html http://www.ebi.ac.uk/clustalw/ http://searchlauncher.bcm.tmc.edu/ However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!

If large datasets become intractable for analysis on the Web, what other resources are available? Desktop software solutions — public domain programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

Therefore, UNIX server-based solutions — Public domain solutions also exist, but now a very cooperative systems manager needs to maintain everything for users, so, commercial products, e.g. the Accelrys GCG Wisconsin Package and the SeqLab Graphical User Interface, simplify matters for administrators and users. One license fee for an entire institution and very fast, convenient database access on local server disks without the need to download and/or reformat sequences. Connections from any networked terminal or workstation anywhere/anytime! Operating system: UNIX command line operation hassles; communications software — ssh and terminal emulation; X11 graphics; file transfer via scp/sftp; and editors — vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only, UNIX line breaks!"]). See the lab tutorial Appendix II.

The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence Analysis GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the University of Wisconsin, Madison; and then starting in 1990 it became a private company; which was acquired by the Oxford Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc., U.S.A., in 2000; and then in 2004 Accelrys, San Diego, California, left Pharmacopeia to become an independent entity. The suite contains around 150 programs designed to work in a “toolbox” fashion. Several simple programs used in succession can lead to very sophisticated results. Also ‘internal compatibility,’ i.e. once you learn to use one program, all programs can be run similarly, and, the output from many programs can be used as input for other programs. Used all over the world at over 950 institutions, so learning it will likely be useful at other research institutions as well.

To answer the always perplexing GCG question — “What sequence(s)? . . . .” Specifying sequences, GCG style;in order of increasing power and complexity — The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and SeqConv+ programs) The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {*}. Finally, the most powerful method of specifying sequences is in a GCG “list” file. This is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, attribute information within list files can specify particular sequence aspects.

‘Clean’ GCG format single sequence file after Reformat or SeqConv+ !!NA_SEQUENCE 1.0 This is a small example of GCG single sequence format. Always put some documentation on top, so in the future you can figure out what it is you're dealing with! The line with the two periods is converted to the checksum line. example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 .. 1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA SeqLab’s Editor mode can also “Import” native GenBank or FastA format and ABI or LI-COR trace files!

Hey, what’s the deal with the new “+” programs? • Quoting directly from the GCG Program Manual: • “Advantages of Plus “+” Programs: • √ Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion. • √ Plus programs remove sequence length restriction of 350,000 bp. • If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.”

OK, on to logical terms for GCG — Sequence databases, nucleic acids: GENBANKPLUS:* all of GenBank plus EST, HTC, and GSS SYNTHETIC:* GenBank synthetic GBP:* all of GenBank plus EST, HTC, and GSS SY:* GenBank synthetic GENBANK:* all of GenBank except EST, HTC, and GSS UNANNOTATED:* GenBank unannotated GB:* all of GenBank except EST, HTC, and GSS UN:* GenBank unannotated BACTERIAL:* GenBank bacteria and archaea REFSEQNUC:* NCBI RefSeq transcriptomes BA:* GenBank bacteria and archaea RS_RNA:* NCBI RefSeq transcriptomes INVERTEBRATE:* GenBank invertebrate IN:* GenBank invertebrate Genome sequence databases, nucleic acids OTHERMAMMAL:* GenBank other mammal OM:* GenBank other mammal HOMO:* NCBI human RefSeq working draft OTHERVERTEBRATE:* GenBank other vertebrate PAN:* NCBI chimpanzee RefSeq working draft OV:* GenBank other vertebrate DANIO:* Sanger Zebrafish assembly PHAGE:* GenBank phage CELEGANS:* NCBI nematidode RefSeq assembly PH:* GenBank phage PLANT:* GenBank plant and fungi Sequence databases, amino acids: PL:* GenBank plant and fungi PRIMATE:* GenBank primate UNIPROT:* all of Swiss-Prot and all of SPTREMBL PR:* GenBank primate UNI: * all of Swiss-Prot and all of SPTREMBL RODENT:* GenBank rodent SWISSPROTPLUS:* all of Swiss-Prot and all of SPTREMBL RO:* GenBank rodent SWP:* all of Swiss-Prot and all of SPTREMBL VI:* GenBank viral SWISSPROT:* all of Swiss-Prot (fully annotated) VIRAL:* GenBank viral SWISS:* all of Swiss-Prot (fully annotated) TAGS:* GenBank EST, HTC, and GSS SW:* all of Swiss-Prot (fully annotated) EST:* GenBank EST Expressed Sequence Tags SPTREMBL:* Swiss-Prot preliminary EMBL translations GSS:* GenBank Genome Survey Sequences SPT:* Swiss-Prot preliminary EMBL translations HTC:* GenBank High Throughput cDNA GENPEPT:* all of GenBank’s CDS translations HTG:* GenBank High Throughput Genomic GP:* all of GenBank’s CDS translations PATENT:* GenBank patent REFSEQPROT:* NCBI RefSeq proteomes PAT:* GenBank patent RS_PROT:* NCBI RefSeq proteomes STS:* GenBank Sequence Tagged Sites These are easy — they make sense and you’ll have a vested interest. But beware BA and PL . . . .

!!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 .. Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // ////////////////////////////////////////////////// GCG MSF & RSF format — !!RICH_SEQUENCE 1.0 .. { name ef1a_giala descrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/2001 16:51:19 strand 1 comments //////////////////////////////////////////////////////////// The trick is to not forget the Braces and ‘wild card,’ e.g. filename{*}, when specifying! This is SeqLab’s native format

The List File Format — remember the @ sign! !!SEQUENCE_LIST 1.0 An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data. .. my-special.pep begin:24 end:134 SwissProt:EfTu_Ecoli Ef1a-Tu.msf{*} /usr/accounts/test/another.rsf{ef1a_*} @another.list The ‘way’ SeqLab works!

SeqLab — GCG’s X-based GUI! SeqLab is the merger of Steve Smith’s Genetic Data Environment and GCG’s Wisconsin Package Interface: GDE + WPI = SeqLab Requires an X11-Windowing environment — either native on UNIX computers (including LINUX, but not included in default Apple Mac OS X installs, see Apple’s free X11 package or XDarwin), or with X-server emulation software on Windows personal computers.

Gunnar von Heijne in his old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: • “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” • He continues: • “. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” Conclusions — FOR MORE INFO... Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html and contact me (stevet@bio.fsu.edu) for further bioinformatics assistance and collaboration.

AND FOR EVEN MORE INFO... Many texts are becoming available in the field. To ‘honk-my-own-horn’ a bit, check out: Current Protocols in Bioinformatics from John Wiley & Sons, Inc., (http://www.does.org/cp/bioinfo.html); and from Horizon Scientific Press, Computational Genomics: Theory and Application (http://www.horizonpress.com/hsp/books/com.html). • From Humana Press, • Introduction to Bioinformatics: • A Theoretical And Practical Approach • (http://www.humanapress.com/Product.pasp?txtCatalog=HumanaBooks&txtCategory=&txtProductID=1-58829-241-X&isVariant=0); They all asked me to contribute chapters on multiple sequence alignment and analysis using GCG software.

Now for some practical examples — Some of my favorite multiple sequence files (RSF) in the SeqLab Editor: Human G-Protein coupled TM7 receptors and Elongation Factor 1. Now it’s your turn — on to the tutorial.

Florida State University — Bioinformatics Workshop #1