Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu

Sequence analysis using EMBOSS & wEMBOSSby Martin Sarachu • Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen,David Martin, Gary Williams and many more.Find this tutorial at www.emboss.org • Throughout this tutorial, we're going to look at members of the • rhodopsin family of G-protein coupled receptors. • The general principles are, of course, applicable to any sequences • you would like to analyse.

Retrieving sequences from databases • Hands-on: Look at databases available with showdb(Information>>showdb) • Output is a simple table displaying the names, contents and accessmethods for the databases. • ID allows programs to extract a single explicitly named entryfrom the database, for example: embl:x13776 • Query indicates that programs can extract a set of matchingwildcard entry names. For example: sw:pax*_human • All allows programs to analyse all the entries in the databasesequentially. For example: embl:* • Hands-on: Retrieve sequence with identifier xlrhodop from embl DB(Edit>>seqret) • Hands-on: Copy the sequence to your current project & include itinto nucList

Getting information about sequences • infoseq is a small utility to list the sequences USA, name, accessionnumber, type (nucleic or protein), length, percentage G+C(for nucleic), and/or description. • Hands-on:Run infoseq (Information>>infoseq) with sequencexlrhodop in your project • This sequence corresponds to a sequence in SwissProt that has theidentifier OPSD_XENLA • Hands-on:Retrieve the information about all OPSD sequences inSwissProt (sw DB, use the opsd_* wildcard)

Pairwise sequence alignment • An alignment is an arrangement of two sequences which showswhere the two sequences are similar, and where they differ. • The most intuitive representation of the comparison between twosequences uses dot-plots. One sequence is represented on eachaxis and significant matching regions are distributed along diagonalsin the matrix. • Hands-on: Upload sequence xl23808 from your computer to thecurrent project and add it to nucList • Hands-on: Make a dotplot with dottup between xl23808 andxlrhodop (Alignment>>Dot Plots>>dottup)

Global alignment • A global alignment is one that compares the two sequences overtheir entire lengths, and is appropriate for comparing sequences thatare expected to share similarity over the whole length. • needle is and implementation of the Needleman-Wunsch algorithmfor global alignment. The computation is rigorous and needle can betime consuming to run if the sequences are long. • Hands-on: do a global alignment between xlrhodop (1-470 region)and xl23808 (1110-1700 region)(Alignment>>Global>>needle) • stretcher is another EMBOSS program for global alignment, it is lessrigorous and therefore run more quickly. Useful for DB searching.

Local alignment • Local alignment methods are very useful for scanning databasesor when you do not know that the sequences are similar over theirentire lengths. • water is a rigorous implementation of the Smith Watermanalgorithm for local alignments. • Hands-on: perform a local alignment between xlrhodop & xl23808(Alignement>>Local>>water) • matcher is a an EMBOSS program for local alignment, it is lessrigorous and therefore run more quickly. Useful for DB searching. • supermatcher is designed for local alignments of very largesequences and is even less rigorous in its implementation. You canlook at its documentation clicking the “Manual” button on theprogram’s menu.

Identifying the ORF • We can get a rapid visual overview of the distribution of ORFs inthe six frames of our sequence using plotorf. • Hands-on: run plotorf with sequence xlrhodop(Nucleic>>Translation>>plotorf) • Longest ORF is in frame 2 from around position 100 to 1200. • Hands-on: identify the exact start and end points for translationwith getorf (Nucleic>>Gene finding>>getorf)Look at output options! Translate your sequence between STARTand STOP codons. • We know from plotorf that our ORF will be in the region 100 to1200. Identify the actual start and end positions.

Translating the sequence • Hands-on: you should have found that the region to be translated isfrom 110 to 1171 in our cDNA sequence. Use transeq to translatethat region (Nucleic>>Translation>>transeq) • Hands-on: copy xlhrodop.pep to your project and add it to protList • pepinfo produces information on amino acid properties (size,polarity, aromaticity, charge, etc). • Hands-on: run pepinfo with xlhrodop.pep and examine theinformation it provides (Protein>>Composition>>pepinfo)

Pattern matching • In a number of cases, the active site of a protein can be recognizedby a specific fingerprint or template, a fairly small set of residuesthat are unique to a family of proteins. An example is the sequenceGXGXXG (where G=glycine and X=any amino acid) which defines aGTP binding site. Searching for a (rather loose) predefined string ofcharacters in a sequence is called Pattern Matching. • Hands-on: use patmatmotifs to search your protein sequencefor motifs defined in PROSITE DB of protein families and domains.(Protein>>Motifs>>patmatmotifs)Look at output options! Specify a full documentation output. • In our case we already know that our sequence is a rhodopsin.However, if you had an unknown sequence, we hope you can seethat identifying motifs might provide you with information to helpyou plan further experiments.

Protein fingerprints • PRINTS is a database that defines functional protein families,identifying each domain by a number of short, particularly wellconserved sequences. • A full match to one of these "fingerprints" will match all the relevantshort sequences in the correct order. • A partial match is recorded if some are missing or if they occur inan incorrect order. • Hands-on: use pscan with your peptide sequence and examine thematches. (Protein>>Motifs>>pscan)

Multiple Sequence Analysis • One of the most popular programs for performing multiple sequencealignments is clustalw. The EMBOSS interface to clustal is emma. • pscan has told us that our sequence belongs to the rhodopsin family.We will now retrieve some further members of the family fromSwissProt and produce a multiple alignment; we'll then use thismultiple alignment to produce a profile of this group of sequences anduse that to align them all to our original sequence. • Hands-on: use seqret to retrieve a set of sequences from SwissProtDB, use the ops2_* wildcard to get all sequences whose identifiersbegin ops2_ • Hands-on: copy the output file to your project, rename it toops2.fasta and add it to protList.

Multiple Sequence Analysis • Hands-on: align these sequences using emma(Alignment>>Multiple>>emma). It will produce an alignment and adendogram. • We have aligned ops2 sequences from two fruit fly species, twocrab species, locust and scallop. • Hands-on: copy the alignment to your project, and view it. Thesequences are similar, but there are differences. Add the alignmentto your protList. • Hands-on: prettyplot will give you a clearer view of differences byaligning the sequences on top of one another.(Alignment>>Multiple>>prettyplot) • Identical residues are shown in red, and similar residues in green.This type of display can given you a first impression regions ofconservation.

Profiles • Profile analysis is a sequence comparison method for finding andaligning distantly related sequences. The comparison allows a newsequence to be aligned optimally to a family of similar sequences. • Hands-on: prophecy is an EMBOSS program for creating a profilefrom a set of multiple aligned sequences. Create a profile from ops2alignment. (Protein>>Profiles>>prophecy)Look at output options! Specify a Gribskov profile type.When prophecy finishes, copy the profile to your current project. • Hands-on: use prophet to align xlrhodop.pep to the ops2 profile.(Protein>>Profiles>>prophet) • The vertical bars (|) represent residues that are identical betweenthe ops2 consensus and our rhodopsin, while the colons (:) representconservative substitutions. We hope you can see that aligningmembers of a family can reveal conserved regions that may beimportant for structure and/or function.

Conclusion • We have shown you some of the programs available withinEMBOSS, and have introduced you to the way you can run theseprograms from wEMBOSS. • You can search for EMBOSS programs within wEMBOSS from the“Search for programs” frame. • You can examine individual program documentation from theprogram menu. • You can get a listing of all EMBOSS programs from wossname(Information>>wossname) • EMBOSS site: www.emboss.org • wEMBOSS site: www.ar.embnet.org/wEMBOSS

Sequence analysis using EMBOSS &amp; wEMBOSS by Martin Sarachu

Sequence analysis using EMBOSS &amp; wEMBOSS by Martin Sarachu

Presentation Transcript

Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu

Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu