Protein structure prediction the holy grail of bioinformatics
Download
1 / 97

Protein structure prediction: The holy grail of bioinformatics - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Protein structure prediction: The holy grail of bioinformatics. Proteins: Four levels of structural organization: Primary structure Secondary structure Tertiary structure Quaternary structure. Primary structure = the linear amino acid sequence.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Protein structure prediction: The holy grail of bioinformatics' - afi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Protein structure prediction the holy grail of bioinformatics
Protein structure prediction:The holy grail of bioinformatics


Proteins: Four levels of structural organization:

Primary structure

Secondary structure

Tertiary structure

Quaternary structure



Secondary structure = spatial arrangement of amino-acid residues that are adjacent in the primary structure


a residues that are adjacent in the primary structure helix = A helical structure, whose chain coils tightly as a right-handed screw with all the side chains sticking outward in a helical array. The tight structure of the a helix is stabilized by same-strand hydrogen bonds between -NH groups and -CO groups spaced at four amino-acid residue intervals.


The residues that are adjacent in the primary structureb-pleated sheet is made of loosely coiled b strands are stabilized by hydrogen bonds between -NH and -CO groups from adjacent strands.


An antiparallel residues that are adjacent in the primary structureβ sheet. Adjacent β strands run in opposite directions. Hydrogen bonds between NH and CO groups connect each amino acid to a single amino acid on an adjacent strand, stabilizing the structure.


A parallel residues that are adjacent in the primary structureβ sheet. Adjacent β strands run in the same direction. Hydrogen bonds connect each amino acid on one strand with two different amino acids on the adjacent strand.


Silk fibroin residues that are adjacent in the primary structure


a residues that are adjacent in the primary structure helix

b sheet (parallel and antiparallel)

tight turns

flexible loops

irregular elements (random coil)


Tertiary structure = three-dimensional structure of protein residues that are adjacent in the primary structure


The tertiary structure is formed by the folding of secondary structures by covalent and non-covalent forces, such ashydrogen bonds,hydrophobic interactions,salt bridgesbetween positively and negatively charged residues, as well asdisulfide bondsbetween pairs of cysteines.



Holoproteins & Apoproteins their contacts.

Holoprotein

Prosthetic group

Apoprotein

Holoprotein

Prosthetic group


Apohemoglobin = 2 their contacts.a + 2b


Prosthetic group their contacts.

Heme



Christian B. Anfinsen their contacts.

1916-1995

Sela M, White FH, & Anfinsen CB. 1959. The reductive cleavage of disulfide bonds and its application to problems of protein structure. Biochim. Biophys. Acta. 31:417-426.


Not all proteins fold independently. their contacts.

Chaperones.


The denaturation and their contacts.

renaturation of proteins


Reducing agents: their contacts.

Ammonium thioglycolate (alkaline) pH 9.0-10

Glycerylmonothioglycolate (acid) pH 6.5-8.2


Oxidant their contacts.


What do we need to know in order to state that the tertiary structure of a protein has been solved

What do we need to know in order to state that the tertiary structure of a protein has been solved?

Ideally: We need to determine the position of all atoms and their connectivity.

Less Ideally: We need to determine the position of all Cbackbone structure).


Protein structure limitations and caveats
Protein structure: Limitations and caveats structure of a protein has been solved?

  • Not all proteins or parts of proteins assume a well-defined 3D structure in solution.

  • Protein structure is not static, there are various degrees of thermal motion for different parts of the structure.

  • There may be a number of slightly different conformations in solution.

  • Some proteins undergo conformational changes when interacting with STUFF.


Experimental protein structure determination
Experimental Protein Structure Determination structure of a protein has been solved?

  • X-ray crystallography

    • most accurate

    • in vitro

    • needs crystals

    • ~$100-200K per structure

  • NMR

    • fairly accurate

    • in vivo

    • no need for crystals

    • limited to very small proteins

  • Cryo-electron-microscopy

    • imaging technology

    • low resolution


Why predict protein structure
Why predict protein structure? structure of a protein has been solved?

  • Structural knowledge = some understanding of function and mechanism of action

  • Predicted structures can be used in structure-based drug design

  • It can help us understand the effects of mutations on structure and function

  • It is a very interesting scientific problem (still unsolved in its most general form after more than 50 years of effort)


Secondary structure prediction structure of a protein has been solved?


Secondary structure prediction structure of a protein has been solved?

  • Historically first structure prediction methods predicted secondary structure

  • Can be used to improve alignment accuracy

  • Can be used to detect domain boundaries within proteins with remote sequence homology

  • Often the first step towards 3D structure prediction

  • Informative for mutagenesis studies


Protein secondary structures simplifications
Protein Secondary Structures (Simplifications) structure of a protein has been solved?

-HELIX

-STRAND

COIL (everything else)


Assumptions
Assumptions structure of a protein has been solved?

  • The entire information for forming secondary structure is contained in the primary sequence

  • side groups of residues will determine structure

  • examining windows of 13-17 residues is sufficient to predict secondary structure

    • a-helices 5–40 residues long

    • b-strands 5–10 residues long


Predicting secondary structure from primary structure
Predicting Secondary Structure From Primary Structure structure of a protein has been solved?

  • accuracy 64-75%

  • higher accuracy for a-helices than for b-sheets

  • accuracy is dependent on protein family

  • predictions of engineered (artificial) proteins are less accurate


A surprising result
A surprising result! structure of a protein has been solved?

Chameleon

sequences


The chameleon sequence
The “Chameleon” sequence structure of a protein has been solved?

sequence 1 sequence 2

TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK

Replace both sequences with

an engineered peptide (“chameleon”)

TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK

a -helix b-strand

Source: Minor and Kim. 1996. Nature 380:730-734


Measures of prediction accuracy
Measures of prediction accuracy structure of a protein has been solved?

  • Qindex and Q3

  • Correlation coefficient


Qindex
Qindex structure of a protein has been solved?

Qindex: (Qhelix, Qstrand, Qcoil, Q3)

  • percentage of residues correctly predicted as a-helix, b-strand, coil, or for all 3 conformations.

    Drawbacks:

    - even a random assignment of structure can achieve a high score (Holley & Karpus 1991)


Correlation coefficient
Correlation coefficient structure of a protein has been solved?

Ca= 1 (=100%)


Methods of secondary structure prediction
Methods of secondary structure prediction structure of a protein has been solved?


First generation methods single residue statistics
First generation methods: structure of a protein has been solved?single residue statistics

Chou & Fasman (1974 & 1978) :

Some residues have particular secondary-structure preferences. Based on empirical frequencies of residues in -helices, -sheets, and coils.

Examples: Glu α-helix

Val β-strand


Chou fasman method
Chou-Fasman method structure of a protein has been solved?


Chou fasman method1
Chou-Fasman Method structure of a protein has been solved?

  • Accuracy: Q3 = 50-60%


Second generation methods segment statistics
Second generation methods: segment statistics structure of a protein has been solved?

  • Similar to single-residue methods, but incorporating additional information (adjacent residues, segmental statistics).

  • Problems:

    • Low accuracy - Q3 below 66% (results).

    • Q3 of -strands (E) : 28% - 48%.

    • Predicted structures were too short.


The gor method
The GOR method structure of a protein has been solved?

  • developed by Garnier, Osguthorpe & Robson

  • build on Chou-Fasman Pij values

  • evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues

  • sliding window of 17 residues

  • underpredicts b-strand regions

  • GOR method accuracy Q3 = ~64%


Third generation methods
Third generation methods structure of a protein has been solved?

  • Third generation methods reached 77% accuracy.

  • They consist of two new ideas:

    1. A biological idea –

    Using evolutionary information based on conservation analysis of multiple sequence alignments.

    2. A technological idea –

    Using neural networks.


Artificial neural networks
Artificial Neural Networks structure of a protein has been solved?

An attempt to imitate the human brain (assuming that this is the way it works).


Neural network models
Neural network models structure of a protein has been solved?

  • machine learning approach

  • provide training sets of structures (e.g. a-helices, non a -helices)

  • computers are trained to recognize patterns in known secondary structures

  • provide test set (proteins with known structures)

  • accuracy ~ 70 –75%


Reasons for improved accuracy
Reasons for improved accuracy structure of a protein has been solved?

  • Align sequence with other related proteins of the same protein family

  • Find members that has a known structure

  • If significant matches between structure and sequence assign secondary structures to corresponding residues


New and improved third generation methods
New and Improved Third-Generation Methods structure of a protein has been solved?

Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments.

  • PHD (Q3 ~ 70%)

    Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.

  • PSIPRED (Q3 ~ 77%)

    Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.

    Arguably remains the top secondary structure prediction method(won all CASP competitions since 1998).


Secondary Structure Prediction structure of a protein has been solved?

Summary

  • 1st Generation - 1970s

    • Q3 = 50-55%

    • Chou & Fausman, GOR

  • 2nd Generation -1980s

    • Q3 = 60-65%

    • Qian & Sejnowski, GORIII

  • 3rd Generation - 1990s

    • Q3 = 70-80%

    • PhD, PSIPRED

  • Many 3rd+ generation methods exist:

    • PSI-PRED - http://bioinf.cs.ucl.ac.uk/psipred/

    • JPRED - http://www.compbio.dundee.ac.uk/~www-jpred/

    • PHD - http://www.embl-heidelberg.de/predictprotein/predictprotein.html

    • NNPRED - http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html


The sequence structure gap
The sequence-structure gap structure of a protein has been solved?

September 13, 2011

More than 13,137,813known protein sequences, 76,495experimentally determined structures.


The sequence structure gap1

The gap is getting bigger structure of a protein has been solved?.

The sequence-structure gap

200000

180000

160000

140000

120000

100000

Sequences

Structures

80000

60000

40000

20000

0


Protein secondary structures simplifications1
Protein Secondary Structures (Simplifications) structure of a protein has been solved?

-HELIX

-STRAND

COIL (everything else)


Beyond secondary structure before tertiary structure
Beyond Secondary Structure structure of a protein has been solved?Before Tertiary Structure

  • Supersecondary structures (motifs): small, discrete, commonly observed aggregates of secondary structures

    • helix-loop-helix

    • bab

  • Domains: independent units of structure

    • b barrel

    • four-helix bundle

  • The terms “domain” and “motif” are sometimes used interchangeably.


Helix-loop-helix structure of a protein has been solved?


Beyond secondary structure before tertiary structure1
Beyond Secondary Structure structure of a protein has been solved?Before Tertiary Structure

Folds: Compact folding arrangements of a polypeptide chain (a protein or part of a protein).

The terms “domain” and “fold” are sometimes used interchangeably.


EF Fold structure of a protein has been solved?

Found in Calcium binding proteins such as Calmodulin


Leucine structure of a protein has been solved? Zipper


Rossman Fold structure of a protein has been solved?

  • The beta-alpha-beta-alpha-beta subunit

  • Often present in nucleotide-binding proteins


b structure of a protein has been solved? sandwich

b barrel


a/b structure of a protein has been solved? horseshoe


Four helix bundle structure of a protein has been solved?

  • 24 amino acid peptide with a hydrophobic surface

  • Assembles into 4 helix bundle through hydrophobic regions

  • Maintains solubility of membrane proteins


TIM Barrel structure of a protein has been solved?


Pdb new fold growth
PDB New Fold Growth structure of a protein has been solved?

  • The number of unique folds in nature is fairly small (possibly a few thousands)

  • 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Old fold

New fold


Protein data bank
Protein data bank structure of a protein has been solved?

  • http://www.rcsb.org/pdb/


Protein 3d structure data
Protein 3D structure data: structure of a protein has been solved?

The structure of a protein consists of the 3D (X,Y,Z) coordinates of each non-hydrogen atom of the protein.

Some protein structure also include coordinates of covalently linked prosthetic groups, non-covalently linked ligand molecules, or metal ions.

For some purposes (e.g. structural alignment) only the Cα coordinates are needed.

Example of PDB format: X Y Z occupancy / temp. factor

ATOM 18 N GLY 27 40.315 161.004 11.211 1.00 10.11

ATOM 19 CA GLY 27 39.049 160.737 10.462 1.00 14.18

ATOM 20 C GLY 27 38.729 159.239 10.784 1.00 20.75

ATOM 21 O GLY 27 39.507 158.484 11.404 1.00 21.88

Note: the PDB format provides no information about connectivity between atoms. The last two numbers (occupancy, temperature factor) relate to disorders of atomic positions in crystals.


Protein structure some computational tasks
Protein structure: Some computational tasks structure of a protein has been solved?

  • Building a protein structure model from X-ray data

  • Building a protein structure model from NMR data

  • Computing the energy for a given protein structure (conformation)

  • Energy minimization: Finding the structure with the minimal energy according to some empirical “force fields”.

  • Simulating the protein folding process (molecular dynamics)

  • Structure visualization

  • Computing secondary structure from atomic coordinates

  • Protein superposition, structural alignment

  • Protein fold classification

  • Threading: finding a fold (prototype structure) that fits to a sequence

  • Docking: fitting ligands onto a protein surface by molecular dynamics or energy minimization

  • Protein 3D structure prediction from sequence


Viewing protein structures
Viewing protein structures structure of a protein has been solved?

  • When looking at a protein structure, we may ask the following types of questions:

    • Is a particular residue on the inside or outside of a protein?

    • Which amino acids interact with each other?

    • Which amino acids are in contact with a ligand (DNA, peptide hormone, small molecule, etc.)?

    • Is an observed mutation likely to disturb the protein structure?

  • Standard capabilities of protein structure software:

    • Display of protein structures in different ways (wireframe, backbone, sticks, spacefill, ribbon.

    • Highlighting of individual atoms, residues or groups of residues

    • Calculation of interatomic distances

    • Advanced feature: Superposition of related structures


Example c abl oncoprotein sh2 domain display wireframe
Example: c-abl oncoprotein SH2 domain, display structure of a protein has been solved?wireframe


Example c abl oncoprotein sh2 domain display sticks
Example: c-abl oncoprotein SH2 domain, display structure of a protein has been solved?sticks


Example c abl oncoprotein sh2 domain display backbone
Example: c-abl oncoprotein SH2 domain, display structure of a protein has been solved?backbone


Example c abl oncoprotein sh2 domain display spacefill
Example: c-abl oncoprotein SH2 domain, display structure of a protein has been solved?spacefill


Example c abl oncoprotein sh2 domain display ribbons
Example: c-abl oncoprotein SH2 domain, display structure of a protein has been solved?ribbons


Predicting protein 3d structure
Predicting protein 3d structure structure of a protein has been solved?

Goal: 3d structure from 1d sequence

An existing fold

A new fold

Fold recognition

ab-initio

Homology modeling


Homology modeling
Homology modeling structure of a protein has been solved?

Based on the two major observations (and some simplifications):

  • The structure of a protein is uniquely defined by its amino acid sequence.

  • Similar sequences adopt similar structures. (Distantly related sequences may still fold into similar structures.)


Homology modeling needs three items of input
Homology modeling needs three items of input: structure of a protein has been solved?

  • The sequence of a protein with unknown 3D structure, the "target sequence."

  • A 3D “template” – a structure having the highest sequence identity with the target sequence ( >30% sequence identity)

  • An sequence alignment between the target sequence and the template sequence


Homology Modeling: How it works structure of a protein has been solved?

  • Find template

  • Align target sequence

  • with template

  • Generate model:

  • - add loops

  • - add sidechains

  • Refine model


Two zones of homology modeling
Two zones of homology modeling structure of a protein has been solved?

[Rost, Protein Eng. 1999]


Automated web based homology modelling
Automated Web-Based Homology Modelling structure of a protein has been solved?

  • SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html

  • WHAT IF : http://www.cmbi.kun.nl/swift/servers/

  • The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/

  • 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/

  • SDSC1 : http://cl.sdsc.edu/hm.html

  • EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/


Fold recognition protein threading
Fold recognition = Protein Threading structure of a protein has been solved?

Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known?


Protein threading

MTYKLILN …. NGVDGEWTYTE structure of a protein has been solved?

Protein Threading

  • The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB

  • Energy function – knowledge (or statistics) based rather than physics based

    • Should be able to distinguish correct structural folds from incorrect structural folds

    • Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments


Protein threading1
Protein Threading structure of a protein has been solved?

  • Basic premise

  • Statistics from Protein Data Bank (~2,000 structures)

  • Chances for a protein to have a structural fold that already exists in PDB are quite good.

The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand)

90% of new structures submitted to PDB in the past three years have similar structural folds in PDB


Protein threading2
Protein Threading structure of a protein has been solved?

Basic components:

  • Structure database

  • Energy function

  • Sequence-structure alignment algorithm

  • Prediction reliability assessment


Protein threading structure database
Protein Threading structure of a protein has been solved?– structure database

  • Build a template database


Process
Process structure of a protein has been solved?

  • Threading - A protein fold recognition technique that involves incrementally replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found.


Fold recognition methods
Fold recognition methods structure of a protein has been solved?

  • 3D-PSSM

    http://www.sbg.bio.ic.ac.uk/~3dpssm/

  • Fugue

    http://www-cryst.bioc.cam.ac.uk/~fugue/

  • HHpredhttp://protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred


Ab initio folding
ab-initio structure of a protein has been solved? folding

Goal: Predict structure from “first principles”

Requires:

  • A free energy function, sufficiently close to the “true potential”

  • A method for searching the conformational space

    Advantages:

  • Works for novel folds

  • Shows that we understand the process

    Disadvantages:

  • Applicable to short sequences only


Rosetta simons et al 1997
Rosetta structure of a protein has been solved?[Simons et al. 1997]

http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php


Qian et al. ( structure of a protein has been solved?Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent.

*70,000 home computers for about two years.


Overall Approach structure of a protein has been solved?

Protein Sequence

Multiple Sequence

Alignment

Database Searching

Homologuein PDB

Secondary

Structure

Prediction

FoldRecognition

No

Yes

PredictedFold

Yes

Sequence-Structure

Alignment

Homology

Modelling

Ab-initioStructure

Prediction

No

3-D Protein Model


Expasy proteomics server expert protein analysis system
ExPASy Proteomics Server: structure of a protein has been solved?Expert Protein Analysis System

links to lots of protein prediction resources

http://expasy.org/


RMSD structure of a protein has been solved?min

The root mean square deviation (RMSD) is the measure of the average distance between the backbones of superimposed proteins. In the study of globular protein conformations, one customarily measures the similarity in three-dimensional structure by the RMSD of the Cα atomic coordinates after optimal rigid body superposition.

A widely used way to compare the structures of biomolecules or solid bodies is to “translate” or rotate one structure with respect to the other to minimize the RMSD. This RMSDmin can be used as a distance measure between two proteins.


ad