Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Bioinformatics of proteins:Sequence, structure and the ‘symbiosis’ between them Maya Schushan The Ben-Tal lab

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

OUTLINE • Sequence: • Databases, domains, motifs & annotations • Structure: • Secondary structure, structure databases, visualization and identification of functional site

Sequences, domains, motifs & annotations UniProt • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium.

Sequences, domains, motifs & annotations UniProt • The world's most comprehensive catalog of information on proteins • Sequence, function & more… • Comprised mainly of the databases: • SwissProt – 366226 last year, 412525 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. • TrEMBL - 5708298 last year, 7341751 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database  many proteins are poorly annotated since only automatic annotation is generated

Sequences, domains, motifs & annotations UniProt • Annotation description includes: • Function(s) of the protein; • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor; • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, • Secondary structure, e.g. alpha helix, beta sheet; • Quaternary structure, i.g. homodimer, heterotrimer, etc.; • Similarities to other proteins; • Disease(s) associated with any number of deficiencies in the protein; • Sequence conflicts, variants, etc

Sequences, domains, motifs & annotations UniProt • Connected to many other databases (e.g. Pfam , Prosite, EC, GO, PdbSum, PDB (to be discussed…)) • Each sequence has a unique 6 letter accession • Entries in SwissProt also have IDs, which usually make sense (e.g. CADH1_HUMAN for a cadherin of humans) • Download sequence in FASTA format

Sequences, domains, motifs & annotations UniProt: http://www.uniprot.org/ Type accession: P05102 Or ID: MTH1 _HAEPH

Sequences, domains, motifs & annotations

Sequences, domains, motifs & annotations General data: name, origin, EC (enzymatic reaction)…

Sequences, domains, motifs & annotations Functional data, including the GO annotations Scroll down to find the sequence & download the FASTA

Sequences, domains, motifs & annotations Known sites, predicted/known secondary structures, Natural variation or mutagenesis

Sequences, domains, motifs & annotations The protein’s sequence in FASTA format Download Send to BLAST

Sequences, domains, motifs & annotations References for all info in the page- important to take a look…

Sequences, domains, motifs & annotations Connections to other databases Other sequence database, e.g. genebank Related structures in the PDB (if available) Model-structure in the ModBase database- automatically derived! All sorts of domain\motifs databases- The family related to the entry

Sequences, domains, motifs & annotations • Pfam- domain database • Proteins are generally composed of one or more functional regions, commonly termed domains. • Different combinations of domains give rise to the diverse range of proteins found in nature. • The identification of domains that occur within proteins can therefore provide insights into their function.

Sequences, domains, motifs & annotations • Pfam- domain database • The Pfam database is a large collection of protein domain • families. • Each family is represented by multiple sequence alignments • and hidden Markov models (HMMs). • Pfam entries are classified in one of four ways: • Family:A collection of related proteins • Domain: A structural unit which can be found in multiple protein contexts • Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present • Motifs: A short unit found outside globular domains

Sequences, domains, motifs & annotations • Pfam- domain database • There are two components to Pfam: • Pfam-A entries are high quality, manually curated families. these Pfam-A entries cover a large proportion of the sequences in the sequence database. • Pfam-B- automatically generated entries. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. • Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

Sequences, domains, motifs & annotations • Pfam- domain database • Allows http://pfam.sanger.ac.uk/ : • Analyze your protein sequence for Pfam matches • View Pfam family annotation and alignments • See groups of related families • Look at the domain organization of a protein sequence • Find the domains on a PDB structure • Query Pfam by keyword

Sequences, domains, motifs & annotations Pfam- domain database Searching for a certain protein accession

Sequences, domains, motifs & annotations Pfam- domain database

Sequences, domains, motifs & annotations • Other domain/motifs databases: • PROSITE • Interpro • BLOCKS • InterPro • SMART • Etc…

Sequences, domains, motifs & annotations • Classifying protein function • Each protein performs one (or more…) specific functions. This can be, e.g., catalyzation of a specific enzymatic reaction, transport of an ion, interaction with a DNA molecule etc… • In order to easily address the specific functions, attempts have been made to numerate and classify the various functions performed by proteins.

Sequences, domains, motifs & annotations • Classifying protein function Example- some of the diverse functions exhibited by Membrane proteins.

Sequences, domains, motifs & annotations • Enzyme Commission number (EC number) • A numerical classification scheme for enzymes, based on the chemical reactions they catalyze • EC numbers do not specify enzymes, but enzyme-catalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number. • By contrast, the UniProt database identifiers uniquely specify a protein by its amino acid sequence.

Sequences, domains, motifs & annotations • Enzyme Commission number (EC number) • Every enzyme code consists of the letters "EC" followed by four numbers separated by periods. Those numbers represent a progressively finer classification of the enzyme. • For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4": • EC 3enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4are hydrolases that act on peptide bonds • EC 3.4.11are those hydrolases that cleave off the amino-terminal amino acid from a polypeptide • EC 3.4.11.4are those that cleave off the amino-terminal end from a tripeptide

Sequences, domains, motifs & annotations • Enzyme Commission number (EC number) • For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4“, as shown for an enzyme from Lactobacillus helveticus in the BRENDA database for Comprehensive Enzyme Information System:

Sequences, domains, motifs & annotations • Enzyme Commission number (EC number) • EC 1 - Oxidoreductases • EC 2 - Transferases • EC 3 - Hydrolases • EC 4 - Lyases • EC 5 - Isomerases • EC 6 - Ligases

Sequences, domains, motifs & annotations • Gene Ontology • A collaborative effort to address the need for consistent descriptions of gene products in different database • The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. • The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels.

Sequences, domains, motifs & annotations Gene Ontology Cellular component A cellular component is just that, a component of a cell, but that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer)

Sequences, domains, motifs & annotations Gene Ontology Biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of biological process terms are signal transduction or pyrimidine metabolism. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

Sequences, domains, motifs & annotations Gene Ontology Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylatecyclase activity or Toll receptor binding.

Sequences, domains, motifs & annotations Gene Ontology Topology The ontologies are in the form of directed acyclic graphs (DAG), with the graph nodes being GO terms. The ontologies are hierarchically structured, a more specialized term (child) can be related to more than one less specialized term (parent). E.g. the biological process hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide. When any gene is involved in hexose biosynthetic process, it is automatically annotated to both hexose metabolic process and monosaccharide biosynthetic process.

Sequences, domains, motifs & annotations Gene Ontology Example

Sequences, domains, motifs & annotations Gene Ontology Interface Search by gene or protein accession http://www.geneontology.org/

Sequences, domains, motifs & annotations • Summary of the first part- protein sequence databases and tools • UniProt- the most comprehensive protein sequence database. Connected to many other databases and resources, • Pfam- domain database. Many others… interpor, prosite, BLOCKS etc. • EC and GO classifications of protein function

OUTLINE • Sequence: • Databases, domains, motifs & annotations • Structure: • Secondary structure, structure databases, visualization and identification of functional site

Investigating & visualizing protein structures From Sequence to Structure • All information about the native structure of a protein is encoded in the amino acid sequence + its native solution environment. • Many possible conformation  still only one or few native folds are exhibited for each protein (Levinthal’s paradox) • Protein folding is driven by various forces: • Ionic forces • Hydrogen bonds • The hydrophobic affect • . . .

Investigating & visualizing protein structures • Secondary Structure Prediction • Why predict secondary structures of proteins? • When the structure of the protein is still unknown. This can serve as the first step for structure prediction- first predict the secondary structures, then how they are arranged together. • 2) For calculating better multiple sequence alignments or pairwise alignments.

Investigating & visualizing protein structures Predicting 2° Structure • Each amino acid has a different propensity for being in each 2° structure. • For example, Proline causes a kink which destroys the helix structure. Thus, Proline is usually found only at the helix end. • The different structures also have typical lengths.

Investigating & visualizing protein structures Predicting 2° Structure http://www.predictprotein.org/

Investigating & visualizing protein structures Predicting 2° Structure All these and more…

Investigating & visualizing protein structures Predicting 2° Structure • Input: Sequence • Output: Secondary structure prediction, globular regions, coiled-coil regions, transmembrane helices, PROSITE motifs, bound cystein… • The Meta Predict Protein server now allows many other options… • http://www.predictprotein.org/meta.php

Investigating & visualizing protein structures Predicting 2° Structure • A common measure is Q3 = the % of amino acids that were predicted correctly. • Today, Q3 is about 75-78% (as determined objectively by CASP) • The theoretical limit is thougt to be about 90%

Investigating & visualizing protein structures Predicting 2° Structure • E.g. PSIPRED • http://bioinf.cs.ucl.ac.uk/psipred/psiform.html • A simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST. • Using a very stringent cross validation method to evaluate the method's performance, PSIPRED recent version achieves an average Q3 score of 80.7%.

Investigating & visualizing protein structures Protein 3D Structures • A protein’s structure has a critical effect on its function: 1. Binding pockets PDB ID 1nw7

Investigating & visualizing protein structures Protein 3D Structures • A protein’s structure has a critical effect on its function: 2. Areas of specific chemical\electrical properties

Investigating & visualizing protein structures Protein 3D Structures • A protein’s structure has a critical effect on its function: 3. Importance of the global fold for function

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them