Protein Structure Prediction

Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure prediction

Homologous Proteins • The term of homology as used in a biological context is defined as similarity of structure, physiology, development and evolution of organisms based upon common genetic factors. • The statement that two proteins are homologous implies that their genes have evolved from a common ancestral gene. Usually they might have similar functions. • Two proteins are considered to be homologous when they have identical amino acid residues in a significant number of sequential positions along the polypeptide chains (> 30 %). • Homologous proteins have conserved structural cores and variable loop regions.

The Divergence of Amino-acid Sequence and 3D Structure for the Core Region of Homologous Proteins • Known structures of 32 pairs of homologous proteins such as globins, serine proteinases, and immunoglobulin domains have been compared. The root mean square deviation of the main-chain atoms of the core regions is plotted as a function of amino acid homology. The curve represents the best fit of the dots to an exponential function. Pairs with high sequence homology are almost identical in three-dimensional structure, whereas deviations in atomic positions for pairs of low homology are on the order of 2 Å.

A Generalized Approach to Predicting Protein Structure • Relevant experimental data • Sequence data/preliminary analysis • Sequence Database searching • Domain assignment • Multiple sequence alignment • Comparative or homology modeling • Secondary structure prediction • Fold Recognition • Analysis of folds and alignment of secondary structures • Sequence to structure alignment

Flow Chart • This flowchart assumes that the protein is soluble, likely comprises a single domain, and does not contain non-globular regions.

Experimental Data Much experimental data can aid the structure prediction process. Some of these are listed below: • Disulphide bonds, which provide tight restraints on the location of cysteines in space • Spectroscopic data, which can give ideas as to the secondary structure content of the protein • Site-directed mutagenesis studies, which can give insights as to residues involved in active or binding sites • Knowledge of proteolytic cleavage sites, post-translational modifications, such as phosphorylation or glycosylation can suggest residues that must be accessible, etc. • Remember to keep all of the available data in mind when doing predictive work. Always ask whether a prediction agrees with the results of experiments. If not, then it may be necessary to modify what has been completed.

Protein Sequence Data • There is some value in doing some initial analysis on the protein sequence. If a protein has come (for example) directly from a gene prediction, it may consist of multiple domains. More seriously, it may contain regions that are unlikely to be globular, or soluble. • Is the protein a transmembrane protein, or does it contain transmembrane segments? There are many methods for predicting these segments, including: • TMAP (EMBL) http://www.mbb.ki.se/tmap/index.html • PredictProtein (EMBL/Columbia) http://dodo.cpmc.columbia.edu/predictprotein/ • TMHMM (CBS, Denmark) • TMpred (Baylor College) • DAS (Stockholm)

http://www.mbb.ki.se/tmap/index.html

COILS - Prediction of Coiled Coil Regions in Proteins • Does the protein contain coiled-coils? Prediction of coiled coils can be completed at the COILS server or by downloading the COILS program. http://www.ch.embnet.org/software/COILS_form.html • COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation. COILS was described in: Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences, Science 252:1162-1164.

Does the Protein Contain Regions of Low Complexity? • Proteins frequently contain runs of poly-glutamine or poly-serine, which do not predict well. To check for this the program SEG (a version of SEG is also contained within the GCG suite of programs) can be employed. ftp://ftp.ncbi.nlm.nih.gov/pub/seg/seg/ • If the answer to any of the above questions is yes, then it is worthwhile trying to break the sequence into pieces or ignore particular sections of the sequence, etc. This is related to the problem of locating domains.

Multiple Sequence Alignment Alignments can provide: • Information to protein domain structure • The location of residues likely to be involved in protein function • Information of residues likely to be buried in the protein core or exposed to solvent • More information on a single sequence for applications like homology modeling and secondary structure prediction.

Sequence Database Searching • The most obvious first stage in the analysis of any new sequence is to perform comparisons with sequence databases to find homologues. These searches can now be performed just about anywhere and on just about any computer. In addition, there are numerous web servers for doing searches, where one can post or paste a sequence into the server and receive the results interactively.

Sequence Database Searching • There are many methods for sequence searching. By far the most well known are the BLAST suite of programs. One can easily obtain versions to run locally (either at NCBI or Washington University), and there are many web pages that permit one to compare a protein or DNA sequence against a multitude of gene and protein sequence databases. To name just a few: • National Center for Biotechnology Information (USA) Searches • http://www.ncbi.nlm.nih.gov/BLAST/ • European Bioinformatics Institute (UK) Searches • http://www2.ebi.ac.uk/ • BLAST search through SBASE (domain database; ICGEB, Trieste)

BLAST • One of the most important advances in sequence comparison recently has been the development of both gapped BLAST and PSI-BLAST (position specific interated BLAST). • Both of these have made BLAST much more sensitive, and the latter is able to detect very remote homologues by taking the results of one search, constructing a profile and then using this to search the database again to find other homologues (the process can be repeated until no new sequences are found). • It is essential that one compares any new protein sequence to the database with PSI-BLAST to see if known structures can be found prior to doing any of the other methods discussed in the next sections.

Sequence Database Searching Other methods for comparing a single sequence to a database include: • The FASTA suite (William Pearson, University of Virginia, USA) • http://alpha10.bioch.virginia.edu/fasta/ • SCANPS (Geoff Barton, European Bioinformatics Institute, UK) • http://barton.ebi.ac.uk/new/software.html • BLITZ (Compugen's fast Smith Waterman search) • http://www2.ebi.ac.uk/bic_sw/

Multiple Sequence Database Searching • It is also possible to use multiple sequence information to perform more sensitive searches. Essentially this involves building a profile from some kind of multiple sequence alignment. A profile essentially gives a score for each type of amino acid at each position in the sequence, and generally makes searches more sensitive. Tools for doing this include: • PSI-BLAST (NCBI, Washington) • ProfileScan Server (ISREC, Geneva) • http://www.isrec.isb-sib.ch/software/PFSCAN_form.html • HMMER Hidden Markov Model searching (Sean Eddy, Washington University) • http://hmmer.wustl.edu/ • Wise package (Ewan Birney, Sanger Centre; this is for protein versus DNA comparisons) and several others. • http://www.sanger.ac.uk/Software/Wise2/

Multiple Sequence Searching Using a Motif • A different approach for incorporating multiple sequence information into a database search is to use a MOTIF. Instead of giving every amino acid some kind of score at every position in an alignment, a motif ignores all but the most invariant positions in an alignment, and just describes the key residues that are conserved and define the family. Sometimes this is called a "signature". • For example, "H-[FW]-x-[LIVM]-x-G-x(5)-[LV]-H-x(3)-[DE]" describes a family of DNA binding proteins. It can be translated as "histidine, followed by either phenylalanine or tryptophan, followed by any amino acid (x), followed by leucine, isoleucine, valine or methionine, followed by any amino acid (x), followed by glycine, . . . [etc.]".

Multiple Sequence Searching Using a Motif • PROSITE (ExPASy Geneva) contains a huge number of such patterns, and several sites allow you to search these data: ExPASy http://www.expasy.ch/tools/scnpsite.html EBI http://www2.ebi.ac.uk/ppsearch/ • It is best to search a few different databases in order to find as many homologues as possible. A very important thing to do, and one which is sometimes overlooked, is to compare any new sequence to a database of sequences for which 3D structure information is available. Whether or not the sequence is homologous to a protein of known 3D structure is not obvious in the output from many searches of large sequence databases. Moreover, if the homology is weak, the similarity may not be apparent at all during the search through a larger database. • One can save a lot of time by making use of pre-prepared protein alignment.

Web sites for Performing Multiple Alignment • EBI (UK) Clustalw Server • http://www2.ebi.ac.uk/clustalw/ • IBCP (France) Multalin Server • http://www.ibcp.fr/multalin.html • IBCP (France) Clustalw Server • IBCP (France) Combined Multalin/Clustalw • MSA (USA) Server • http://www.ibc.wustl.edu/ibc/msa.html • BCM Multiple Sequence Alignment ClustalW Sever • http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html

Some Tips for Sequence Alignment • Don't just take everything found in the searches and feed them directly into the alignment program. Searches will almost always return matches that do not indicate a significant sequence similarity. Look through the output carefully and throw things out if they don't appear to be a member of the sequence family. Inclusion of non-members in the alignment will confuse things and likely lead to errors later. • Remember that the programs for aligning sequences aren't perfect, and do not always provide the best alignment. This is particularly so for large families of proteins with low sequence identities. If a better way of aligning the sequences is discovered, then by all means edit the alignment manually.

Locating Domains • If the sequence has more than about 500 amino acids, it is almost certain that it will be divided into discrete functional domains. If possible, it is preferable to split such large proteins up and consider each domain separately. One can predict the location of domains in a few different ways. The methods below are given (approximately) from the most to the least confident. • If homology to other sequences occurs only over a portion of the probe sequence and the other sequences are whole (i.e. not partial sequences), then this provides the strongest evidence for domain structure. Either complete database searches or make use of pre-defined databases of protein domains. Searches of these databases (see links below) will often assign domains easily.

Locating domains • Regions of low-complexity often separate domains in multi-domain proteins. Long stretches of repeated residues, particularly Proline, Glutamine, Serine or Threonine often indicate linker sequences and are usually a good place to split proteins into domains. • Low complexity regions can be defined using the program SEG which is generally available in most BLAST distributions or web servers. • Transmembrane segments are also very good dividing points, since they can easily separate extracellular from intracellular domains.

Locating Domains • Something else to consider are the presence of coiled-coils. These unusual structural features sometimes (but not always) indicate where proteins can be divided into domains. • Secondary structure prediction methods will often predict regions of proteins to have different protein structural classes. For example, one region of a sequence may be predicted to contain only a helices and another to contain only b sheets. These can often, though not always, suggest likely domain structure. • If a sequence has been separated into domains, then it is very important to repeat all the database searches and alignments using the domains separately. Searches with sequences containing several domains may not find all sub-homologies, particularly if the domains are abundant in the database (e.g. kinases, SH2 domains, etc.).

Domain Assignment

Locating Domains by Web Sites • SMART (Oxford/EMBL) • http://smart.embl-heidelberg.de/ • PFAM (Sanger Center/Wash-U/Karolinska Intitutet) • http://www.sanger.ac.uk/Software/Pfam/search.shtml • COGS (NCBI) • PRINTS (UCL/Manchester) • BLOCKS (Fred Hutchinson Cancer Research Center, Seattle) • http://blocks.fhcrc.org/blocks/blocks_search.html • SBASE (ICGEB, Trieste) • Domain descriptions can also be located in the annotations in SWISSPROT.

P68 RNA Helicase • ssyssdrdr grdrgfgapr fggsrtgpls gkkfgnpgek lvkkkwnlde lpkfeknfyqehpdlarrta qevdtyrrsk eitvrghncp kpvlnfyean fpanvmdvia rhnfteptai qaqgwpvals gldmvgvaqt gsgktlsyll paivhinhhp flergdgpic lvlaptrelaqqvqqvaaey cracrlkstc iyggapkgpq irdlergvei ciatpgrlid flecgktnlrrttylvldea drmldmgfep qirkivdqir pdrqtlmwsa twpkevrqla edflkdyihinigalelsan hnilqivdvc hdvekdekli rlmeeimsek enktivfvet krrcdeltrkmrrdgwpamg ihgdksqqer dwvlnefkhg kapiliatdv asrgldvedv kfvinydypnssedyihrig rtarstktgt aytfftpnni kqvsdlisvl reanqainpk llqlvedrgs grsrgrggmk ddrrdrysag krggfntfrd renydrgysn llkrdfgakt qngvysaanytngsfgsnfv sagiqtsfrt gnptgtyqng ydstqqygsn vanmhngmnq qayaypvpqp apmigypmpt gysq 614 aa f015812 (Genebank)

Sequence Alignment of p68 to DEAD Proteins Walker A AXTGSGKT Walker A motif for ATP binding DEAD ATP binding, ATP hydrolysis SAT Transmission energy from ATP to unwind RNA

P68 RNA Helicase

Comparative or Homology Modeling • If the protein sequence shows significant homology to another protein of known three-dimensional structure, then a fairly accurate model of the protein 3D structure can be obtained via homology modeling. • It is also possible to build models if one has found a suitable fold via fold recognition and is satisfied with the alignment of sequence to structure (Note that the accuracy of models constructed in this manner has not been assessed properly, so treat with caution).

Comparative or Homology Modeling • It is possible now to generate models automatically using the very useful SWISSMODEL server. It is possible to send in a protein sequence only when the degree of sequence homology is high (50% or greater). It is best, particularly if one has edited an alignment, to send an alignment directly to the server. • http://www.expasy.ch/swissmod/SWISS-MODEL.html Some other sites useful for homology modeling include: • WHAT IF (G. Vriend, EMBL, Heidelberg) • http://www.cmbi.kun.nl/whatif/ • MODELLER (A. Sali, Rockefeller University) • http://guitar.rockefeller.edu/modeller/modeller.html • MODELLER Mirror FTP site

Swiss-Model of P68 Based on EIF-4A DEAD SAT • EIF-4A is the initiation factor (1QAV) with 1.8 Å resolution. Walker A AQSGTGKT

Methods for Single Sequences • Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. • Although the authors originally claimed quite high accuracies (70 - 80 %), under careful examination, the methods were shown to be only between 56 and 60 % accurate (Kabsch & Sander, 1984). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.

Methods for Single Sequences Early methods on single sequences • Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222. • Lim, V.I. (1974). Journal of Molecular Biology, 88, 857-872. • Garnier, J., Osguthorpe, D.~J. \& Robson, B. (1978).Journal of Molecular Biology, 120, 97-120. • Kabsch, W. & Sander, C. (1983). FEBS Letters, 155, 179-182. (An assessment of the above methods) Later methods on single sequences • Deleage, G. & Roux, B. (1987). Protein Engineering , 1, 289-294 (DPM) • Presnell, S.R., Cohen, B.I. & Cohen, F.E. (1992). Biochemistry, 31, 983-993. • Holley, H.L. & Karplus, M. (1989). Proceedings of the National Academy of Science, 86, 152-156. • King, R. & Sternberg, M. J.E. (1990). Journal of Molecular Biology, 216, 441-457. • D. G. Kneller, F. E. Cohen & R. Langridge (1990) Improvements in Protein Secondary Structure Prediction by an • Enhanced Neural Network, Journal of Molecular Biology, 214, 171-182. (NNPRED)

Assignment of Amino Acids

Frequency of Occurrence of Amino Acids in the b Turns

Secondary Structure Prediction Methods & Links There are now many web servers for structure prediction, here is a quick summary: • PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick) • JPRED Consensus prediction (Cuff & Barton, EBI) • http://barton.ebi.ac.uk/servers/jpred.html • PREDATORFrischman & Argos (EMBL) • http://www.embl-heidelberg.de/cgi/predator_serv.pl • PHD home page Rost & Sander, EMBL, Germany • http://www.embl-heidelberg.de/predictprotein/predictprotein.html • ZPRED server Zvelebil et al., Ludwig, U.K. • http://kestrel.ludwig.ucl.ac.uk/zpred.html (GOR) • nnPredict Cohen et al., UCSF, USA. • http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html • BMERC PSA Server Boston University, USA • http://bmerc-www.bu.edu/psa/ • SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA. • http://dot.imgen.bcm.tmc.edu:9331/pssprediction/pssp.html

Recent Improvements • The availability of large families of homologous sequences revolutionized secondary structure prediction. • Traditional methods, when applied to a family of proteins rather than a single sequence, proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actually much more useful than those for single sequence, since they tend to predict the core accurately. • Moreover, the limit of 70 – 80 % may be a function of secondary structure variation within homologous proteins.

Protein Structure Prediction