2d 3d structure modelling
This presentation is the property of its rightful owner.
Sponsored Links
1 / 116

2d-3D Structure Modelling PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on
  • Presentation posted in: General

2d-3D Structure Modelling. S. Shahriar Arab. Flow of information. DNA. RNA. PROTEIN SEQ. PROTEIN STRUCT. PROTEIN FUNCTION. ………. Prediction in bioinformatics. Important prediction problems: Protein sequence from genomic DNA Protein 3D structure from sequence

Download Presentation

2d-3D Structure Modelling

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


2d 3d structure modelling

2d-3D Structure Modelling

  • S. Shahriar Arab


Flow of information

Flow of information

DNA

RNA

PROTEIN SEQ

PROTEIN STRUCT

PROTEIN FUNCTION

……….


Prediction in bioinformatics

Prediction in bioinformatics

  • Important prediction problems:

    • Protein sequence from genomic DNA

    • Protein 3D structure from sequence

    • Protein function from structure

    • Protein function from sequence


Why predict protein structure

Why predict protein structure?

  • The sequence structure gap

    • Over millions known sequences, 80 000 known structures

  • Structural knowledge brings understanding of function and mechanism of action

  • Can help in prediction of function


Why predict protein structure1

Why predict protein structure?

  • Predicted structures can be used in structure based drug design

  • It can help us understand the effects of mutations on structure or function

  • It is a very interesting scientific problem

    • still unsolved in its most general form after more than 20 years of effort


What is protein structure prediction

What is protein structure prediction?

  • In its most general form

    • a prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)


Methods of structure prediction

Methods of structure prediction

  • Ab initio protein folding approaches

  • Comparative (homology) modelling

  • Fold recognition/threading


Prediction in one dimension

Prediction in one dimension

  • Secondary structure prediction

  • Surface accessibility prediction


2d structure identification

2D Structure Identification

  • DSSP - Database of Secondary Structures for Ps (http://swift.cmbi.kun.nl/gv/dssp/)

  • VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)

  • PDB - Protein Data Bank (www.rcsb.org)

QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA

HHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC


Secondary structure

-

Secondary Structure

  • The DSSP code

  • H = alpha helix

  • B = residue in isolated beta-bridge

  • E = extended strand, participates in beta ladder

  • G = 3-helix (3/10 helix)

  • I = 5 helix (pi helix)

  • T = hydrogen bonded turn

  • S = bend

  • C= coil

-


Simplifications

Simplifications

  • Eight states from DSSP

  • H: α−helix

  • G: 310 helix

  • I: π-helix

  • E: β−strand

  • B: bridge

  • T: β−turn

  • S: bend

  • C: coil

  • CASP Standard

  • H = (H, G, I),

  • E = (E, B),

  • C = (C, T, S)

  • Identification of secondary structures focused on

  • α-helices

  • β -strands

  • others (turns, coils, other helices) are collectively called “coils”


What is secondary structure prediction

What is Secondary structure prediction?

  • Given a protein sequence (primary structure)

GHWIATRGQLIREAYEDYRHFSSECPFIP

  • Predict its secondary structure content

  • (C=coils H=Alpha Helix E=Beta Strands)

CEEEEECHHHHHHHHHHHCCCHHCCCCCC


Why secondary structure prediction

Why Secondary Structure Prediction?

  • Simply easier problem than 3D structure prediction

  • Accurate secondary structure prediction can be an important information for the tertiary structure prediction

  • Improving alignment accuracy

  • Protein function prediction

  • Protein classification


Secondary structure prediction

secondary structure prediction

  • less detailed results

    • only predicts the H (helix), E (extended) or C (coil/loop) state of each residue, does not predict the full atomic structure

  • Accuracy of secondary structure prediction

    • The best methods have an average accuracy of just about 73% (the percentage of residues predicted correctly)


History of protein secondary structure prediction

History of protein secondary structure prediction

  • First generation

    • How: single residue statistics

    • Example: Chou-Fasman method, LIM method, GOR I, etc

    • Accuracy: low

  • Secondary generation

    • How: segment statistics

    • Examples: ALB method, GOR III, etc

    • Accuracy: ~60%

  • Third generation

    • How: long-range interaction, homology based

    • Examples: PHD

    • Accuracy: ~70%


Chou fasman method

Chou-Fasman Method

  • Developed by Chou & Fasman in 1974 & 1978

  • Based on frequencies of residues in α-helices, β-sheets and turns

  • Accuracy ~50 - 60% Q3


Chou fasman statistics

Chou-Fasman statistics

  • R – amino acid, S- secondary structure

  • f(R,S) – number of occurrences of R in S

  • Ns – total number of amino acids in conformation S

  • N – total number of amino acids

  • P(R,S) – propensity of amino acid R to be in structure S

    • P(R,S) = (f(R,S)/f(R))/(Ns/N)


Example

Example

  • #residues=20,000,

  • #helix=4,000,

  • #Ala=2,000,

  • #Ala in helix=500

  • f(Ala, α) = 500/20,000,

  • f(Ala) = 2,000/20,000

  • p(α) = Να/Ν=4,000/20,000

  • P = (500/2000) / (4,000/20000) = 1.25


Chou fasman statistics1

Chou-Fasman Statistics


Amino acid propensities

Amino acid propensities


Scan peptide for helix regions

Scan peptide for α−helix regions

2.Identify regions where 4/6 have a

P(H) >100 “alpha-helix nucleus”


Extend helix nucleus

Extend α-helix nucleus

3.Extend helix in both directions until a set of four residues have an average P(H) <100.

Repeat steps 1 – 3 for entire peptide


Scan peptide for sheet regions

Scan peptide for β-sheet regions

4. Identify regions where 3/5 have a

P(E) >100 “β-sheet nucleus”

5. Extend β-sheet until 4 continuous residues an have an average P(E) < 100

6. If region average > 105 and the average P(E) > average P(H) then “β-sheet”


The gor method

The GOR method

  • developed by Garnier, Osguthorpe& Robson

  • build on Pij values based on information theory

  • evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues

  • sliding window of 17

  • GOR III method accuracy ~64% Q3


Second generation

Second generation


Gor idea statistics that take into account the whole window

GOR idea: Statistics that take into account the whole window

  • Each residue caries two different types of information:

  • Intra-residue information – information about it’s own secondary structure

  • Inter-residue information – the influence of this residue on other residue


Gor continued

GOR….continued

  • Individual propensity of amino acid R to be in secondary structure S.– same idea as in Chou – Fasman

  • Contribution of 16 neighbors.

  • - take the window of radius 8 around the residue in question (8 before and 8 after the residue)

  • - for each residue in the window consider it’s contribution to the conformation of the middle residue and this it’s value to PH, PS, PC.

  • -Like in Chou-Fasman the values of all contributions are based on statistics.


Third generation

Third generation


Nearest neighbour method

Nearest Neighbour Method

  • Idea: similar sequences are likely have same secondary structure.

  • Take a window around amino acid the conformation of which is to be predicted

  • Find several, say k, closest sequences (with respect to a similarity measure defined differently depending on the variant of the method) of known structure.

  • Assign secondary structure based on conformation of the sequence neighbours.

  • Use max (nα, nβ, nc) or max(sα, sβ, sc)

  • Key: Scoring measure of evolutionary similarity.

  • Salamov, Solovyev NNSSP (1995) accuracy above 70%


Neighbours

Neighbours

1 - LH H H H H HL L - S1

2 - LL H H H H HL L - S2

3 - L E E E E E E L L - S3

4 - L E E E E E E L L - S4

n - LL L L E E E E E - Sn

n+1 - HH H L L LE E E - Sn+1

:

  • max (nα, nβ, nL) or max (Σsα, Σsβ, ΣsL) or something else…


Advantages

Advantages

  • Information from structural neighbours can be used to provide details to predicted secondary structure (phi,psi angles)

  • Much higher accuracy than previous methods.


Neural network models

Neural network models

  • machine learning approach

  • provide training sets of structures (e.g. α-helices, non α -helices)

  • computers are trained to recognize patterns in known secondary structures

  • provide test set(proteins with known structures)

  • accuracy ~ 70 –75%


Neural network method

Neural Network Method

Recall artificial neurone:


How phd works

How PHD works

  • Step 1. BLAST search with input sequence

  • Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position


How phd works cont

How PHD works (cont.)

  • Step3. Level 1: sequence to structure

  • Take window of 13 adjacent residues

  • Scores for helix, strand, loop in the output layer, for each residue


Prediction tools that use nns

Prediction tools that use NNs

  • MACMATCH

  • (Presnell et al., 1993)

  • for Macintoch

  • PHD

  • - (Rost & Sander, 1993)

  • http://www.predictprotein.org/

  • NNPREDICT

  • (Kneller et al. 1990)

  • http://www.cmpharm.ucsf.edu/nomi/nnpredict.html


Phd prediction of rcd2

PHD Prediction of rCD2


Prediction accuracy

Prediction Accuracy


Best of the best

Best of the Best

  • PredictProtein-PHD (72%)

    • http://www.predictprotein.org/

  • Jpred (73-75%)

    • http://jura.ebi.ac.uk:8888/

  • PREDATOR (75%)

    • http://www.embl-heidelberg.de/cgi/predator_serv.pl

  • PSIpred (77%)

    • http://insulin.brunel.ac.uk/psipred


Accessible surface area

Solvent Probe

Accessible Surface

Reentrant Surface

Van der Waals Surface

Accessible Surface Area


Asa calculation

ASA Calculation

  • DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp)

  • VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)

  • GetArea - www.scsb.utmb.edu/getarea/area_form.html

QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD

BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE

1056298799415251510478941496989999999


Other asa sites

Other ASA sites

  • Connolly Molecular Surface Home Page

    • http://www.biohedron.com/

  • Naccess Home Page

    • http://sjh.bi.umist.ac.uk/naccess.html

  • ASA Parallelization

    • http://cmag.cit.nih.gov/Asa.htm

  • Protein Structure Database

    • http://www.psc.edu/biomed/pages/research/PSdb/


Accessibility

Accessibility

  • Accessible Surface Area (ASA)

  • in folded protein

  • Accessibility =

  • Maximum ASA

  • Two state = b(buried) ,e(exposed)

  • e.g. b<= 16% e>16%

  • Three state = b(buried),I(intermediate), e(exposed)

  • e.g. b<=16% 16%>i,<36% e>36%


Accessibility prediction

QHTAW...

QHTAWCLTSEQHTAAVIW

BBPPBEEEEEPBPBPBPB

Accessibility Prediction

  • PredictProtein-PHDacc(58%)

    • http://cubic.bioc.columbia.edu/predictprotein

  • PredAcc (70%?)

    • http://condor.urbb.jussieu.fr/PredAccCfg.html


Phd prediction of rcd21

PHD Prediction of rCD2


3d structure prediction

3D structure prediction


3d structure prediction of proteins

New folds

Existing folds

Building by homology

Ab initio prediction

Threading

0 10 20 30 40 50 60 70 80 90 100

similarity (%)

3D structure prediction of proteins


Choice of prediction methods

Choice of prediction methods

  • If you can find similar sequences of known structure then comparative modelling is the best way to predict structure

    • all other methods are less reliable

  • Of course, you can’t always find similar sequences of known structure.


When you can t do comparative modelling

When you can’t do comparative modelling?

  • Secondary structure prediction

  • Fold recognition/threading

  • Ab initio protein folding approaches


Divergent evolution

Divergent evolution

  • Different proteins in different organisms have diverged from a common ancestor protein

  • Each copy of this ancestor in various organisms has been subject to mutations, deletions, and insertions of amino acids in its sequence

  • In general, its 3-D fold and function have remained similar


Homology modelling of proteins

Homology Modelling of Proteins

  • Prediction of three dimensional structure of a target protein from the amino acid sequence (primary structure) of a homologous (template) protein for which an X-ray or NMR structure is available.


Comparative modelling

Comparative modelling

  • Makes a prediction of tertiary structure based on

    • sequences of known structure which are similar to the target sequence (called template structures)

    • an alignment between these and the target sequence

  • Remember: ~25% seq ID means two proteins have the same basic structure


Can and cannot of homology modelling

Can and cannot of homology modelling

  • Best results relatively to other methods

  • Unreliable in predicting the conformations of insertions or deletions

  • Comparative models are unlikely to be useful in modelling ligand docking (drug design) unless the sequence identity with the template is >70%, and even then, less reliable than an empirical crystallographic or NMR structure.


What is good comparative model

What is “good” comparative model

  • Take the 3D alignment between predicted structure A’ and native structure A.

  • Let a1,…..a n be the coordinates of carbon atoms in the native structure and a’1,…..a’n in predicted structure

  • <2 A rmsd is good for homology modelling results.


Factors affecting accuracy

Factors affecting accuracy

  • The accuracy of comparative modelling is controlled by the quality of the alignment between target sequence and template structures

  • Alignment is easier if the sequences are closely related (e.g. sequence identity > 80%).


Homology model

Homology model

Target sequence

Select templates from DB

Align target sequence with template structures

Build a model and evaluate


Homology modelling assumptions

Homology Modelling Assumptions

  • The overall 3-D structure of the target protein is not dissimilar

  • to that of the related proteins.

  • Regions of homologous sequence have similar structure.

  • Residues homologous throughout a family of proteins are

  • conserved structurally.

  • Residues involved in biological activity have similar topology

  • throughout the protein family.

  • Loop regions (non-conserved residues) allow insertions

  • and deletions without disrupting the overall structure of the protein.

  • Loop regions are flexible and therefore need not be constructed

  • as strictly as the conserved regions - assuming that they play no role

  • in biological activity.


Homology modelling of proteins1

Homology Modelling of Proteins

  • Steps in Molecular Modelling

  • Identification of structures that will form the template for the target structure (model).

  • Sequence Alignment.- The most important step. For proteins with low homology sequences with the query protein (~<30% Percentage sequence identity), the model can be improved by using secondary structure prediction (i.e. align-model-realign-remodel).

  • Transfer the coordinates from the template(s) to the target of structurally conserved regions (SCR’s)

  • - many fragment method

  • - single structure

  • Modelling variable regions.

  • - Loops Insertions: Search of a high resolution fragment database

  • - Deletions: local minimization often sufficient.

  • Modelling of side chains

  • - Rotamer database

  • Minimization

  • - Local-specially loop-hinge regions

  • - Global


Model building from template

Model Building from template

Core conserved regions

Protein Fold

Variable Loop regions

Side chains

Calculate the framework from average of all template structures

Multiple templates

Generate one model for each template and evaluate


Model in loops

Database search for 5 residue fragments

5 residue insertion

Anchor points (2 residues)

annealing

Model in loops

  • If it is a short deletion - often local Minimization is sufficient.

  • Insertions:

    • Look for same length in another homologue

    • Search database of short High Resolution fragments

      • Lowest RMSD from Anchor points

      • Best Sequence Homology

      • Least interference with Core structure.


Side chain modelling

Same S.C. conformer taken from template.

substitution: build based on rotamer library & energetics.

Partial Similarity: Most S.C. build on template.

Side Chain modelling


Core model with side chains

Core model with side chains


Minimization

Minimization

  • LOCAL: Minimize a fragment. Usually a loop and its anchor regions - as these often have bad geometries. First minimize without influence of surrounding structure then take surrounding structure into account.

  • GLOBAL: Minimize whole protein (& H2O). Mainly to relieve short contacts and to rectify bad geometry, like bond angles, peptide planarity etc.


Errors in models

Errors in Models !!!

  • Incorrect template selection

  • Incorrect alignments

  • Errors in positioning of side-chains and loops


Fold recognition or threading

Fold recognition or threading

  • Aimed at detecting when the target sequence adopts a known fold, even if it has no significant similarity to sequences of known fold


How many folds are there

How many folds are there ?

SCOP: Structural Classification of Proteins. 1.75 release38221 PDB Entries (23 Feb 2009). 110800 Domains. 1 Literature Reference(excluding nucleic acids and theoretical models)

Source: http://scop.mrc-lmb.cam.ac.uk/scop/count.html


Threading

Threading


Definition

Definition

  • Threading - A protein fold recognition technique that involves replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found.


Why threading

Why Threading?

  • Secondary structure is more conserved than primary structure

  • Tertiary structure is more conserved than secondary structure

  • Therefore very remote relationships can be better detected through 2D or 3D structural homology instead of sequence homology


Threading idea

Threading idea

  • Choose a set of candidate structures - templates.

  • Align a sequence of proteins of unknown structure to each template structure.

  • Design a test that will evaluate which template is the most likely candidate for the correct fold for the given sequences. If none is reasonable – be able to recognize it as a possible new fold.


Threading1

Threading

  • Database of 3D structures and sequences

    • Protein Data Bank (or non-redundant subset)

  • Query sequence

    • Sequence < 25% identity to known structures

  • Alignment protocol

    • Dynamic programming

  • Evaluation protocol

    • Distance-based potential or secondary structure

  • Ranking protocol


2 kinds of threading

2 Kinds of Threading

  • 2D Threading

    • Prediction Based Methods (PBM)

      • Predict secondary structure (SS) or ASA of query

      • Evaluate on basis of SS and/or ASA matches

  • 3D Threading

    • Distance Based Methods (DBM)

      • Create a 3D model of the structure

      • Evaluate using a distance-based “hydrophobicity or pseudo-thermodynamic potential


  • 2d threading algorithm prediction based method

    2D Threading Algorithm(prediction based method)

    • Convert PDB to a database containing sequence, SS and ASA information

    • Predict the SS and ASA for the query sequence

    • Perform a dynamic programming alignment using the query against the database (include sequence, SS & ASA)

    • Rank the alignments and select the most probable fold


    Dynamic programming

    G

    E

    N

    E

    T

    I

    C

    S

    G

    10

    0

    0

    0

    0

    0

    0

    0

    E

    0

    10

    0

    10

    0

    0

    0

    0

    N

    0

    0

    10

    0

    0

    0

    0

    0

    E

    0

    0

    0

    10

    0

    10

    0

    0

    S

    0

    0

    0

    0

    0

    0

    0

    10

    I

    0

    0

    0

    0

    0

    10

    0

    0

    S

    0

    0

    0

    0

    0

    0

    0

    10

    Dynamic Programming

    G

    E

    N

    E

    T

    I

    C

    S

    G

    60

    40

    30

    20

    20

    0

    10

    0

    E

    40

    50

    30

    30

    20

    0

    10

    0

    N

    30

    30

    40

    20

    20

    0

    10

    0

    E

    20

    20

    20

    30

    20

    10

    10

    0

    S

    20

    20

    20

    20

    20

    0

    10

    10

    I

    10

    10

    10

    10

    10

    20

    10

    0

    S

    0

    0

    0

    0

    0

    0

    0

    10


    S ij identity matrix

    Sij (Identity Matrix)

    A C D E F G H I K L M N P Q R S T V W Y

    A1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    D 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    H 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

    I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

    K 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

    L 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

    M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

    N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

    P 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

    Q 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

    R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

    S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

    T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

    V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

    W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

    Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1


    A simple example

    A Simple Example...

    A A T V D

    A 1

    V

    V

    D

    A A T V D

    A 1 1

    V

    V

    D

    A A T V D

    A 1 1 0 0 0

    V

    V

    D

    A A T V D

    A 1 1 0 0 0

    V 0

    V

    D

    A A T V D

    A 1 1 0 0 0

    V 0 1 1

    V

    D

    A A T V D

    A 1 1 0 0 0

    V 0 1 1 2

    V

    D


    2d 3d structure modelling

    A Simple Example...

    A A T V D

    A 1 1 0 0 0

    V 0 1 1 2 1

    V

    D

    A A T V D

    A 1 1 0 0 0

    V 0 1 1 2 1

    V 0 1 1 2 2

    D 0 1 1 1 3

    A A T V D

    A 1 1 0 0 0

    V 0 1 1 2 1

    V 0 1 1 2 2

    D 0 1 1 1 3

    A A T V D

    | | | |

    A - V V D

    A A T V D

    | | | |

    A V V D

    A A T V D

    | | | |

    A V - V D


    Let s include 2 d info asa

    Let’s Include 2D info & ASA

    H E C

    E P B

    strc

    asa

    Sij

    Sij

    H 1 0 0

    E 0 1 0

    C 0 0 1

    E 1 0 0

    P 0 1 0

    B 0 0 1

    total

    strc

    asa

    seq

    Sij = k1Sij + k2Sij + k3Sij


    2d 3d structure modelling

    E E E C C

    E E E C C

    E E E C C

    E E E C C

    E E E C C

    E E E C C

    A A T V D

    A 2 2 1 0 0

    V 1 3 3

    V

    D

    A A T V D

    A 2 2 1 0 0

    V 1 3 3 3

    V

    D

    A A T V D

    A 2 2 1 0 0

    V 1

    V

    D

    A A T V D

    A 2

    V

    V

    D

    A A T V D

    A 2 2

    V

    V

    D

    A A T V D

    A 2 2 1 0 0

    V

    V

    D

    E

    E

    C

    C

    E

    E

    C

    C

    E

    E

    C

    C

    E

    E

    C

    C

    E

    E

    C

    C

    E

    E

    C

    C

    A Simple Example...


    A simple example1

    A Simple Example...

    E E E C C

    E E E C C

    E E E C C

    A A T V D

    A 2 2 1 0 0

    V 1 3 3 3 2

    V

    D

    A A T V D

    A 2 2 1 0 0

    V 1 3 3 3 2

    V 0 2 3 5 4

    D 0 2 3 4 7

    A A T V D

    A 2 2 1 0 0

    V 1 3 3 3 2

    V 0 2 3 5 4

    D 0 2 3 4 7

    E

    E

    C

    C

    E

    E

    C

    C

    E

    E

    C

    C

    A A T V D

    | | | |

    A - V V D

    A A T V D

    | | | |

    A V V D

    A A T V D

    | | | |

    A V - V D


    2d threading performance

    2D Threading Performance

    • In test sets 2D threading methods can identify 30-40% of proteins having very remote homologues (i.e. not detected by BLAST) using “minimal” non-redundant databases (<700 proteins)

    • If the database is expanded ~4x the performance jumps to 70-75%


    2d threading advantages

    2D Threading Advantages

    • Algorithm is easy to implement

    • Algorithm is very fast (10x faster than 3D threading approaches)

    • The 2D database is small (<500 kbytes) compared to 3D database (>2 Gbytes)

    • Appears to be just as accurate as DBM or other 3D threading approaches

    • Very amenable to web servers


    2d threading disadvantages

    2D Threading Disadvantages

    • Reliability is not 100% making most threading predictions suspect unless experimental evidence can be used to support the conclusion

    • Does not produce a 3D model at the end of the process

    • Doesn’t include all aspects of 2D and 3D structure features in prediction process


    Servers predictprotein

    Servers - PredictProtein


    Servers psipred

    Servers - PSIPRED


    Servers libra i

    Servers - LIBRA I


    More servers www bronco ualberta ca

    More Servers - www.bronco.ualberta.ca


    Force fields

    Force Fields

    • Molecular Mechanics

    • Statistical or Knowledge based


    Molecular mechanic force field

    Molecular Mechanic Force Field

    • EFF = Estr+ Ebend + Etors + Eoop(bonded Terms)

      • + Evdw + Eel + Ehb (Non-bonded Terms)

    • + Estr-str + Estr-bnd + Estr-tor + Ebnd-bnd + Ebnd-tor(Cross Terms)

    Estr = Σi kbi ( bi – b0 )2 (Bond length)

    Ebend = Σi kθi ( θi – θ0 )2 (Bond angle)

    Etors = Σi kςi ( cos(3ςi + γ0 )) (Torsion angle)

    Eoop = Σi kimp (χ−χ0)2 (Improper quadratic out of plan)

    Evdw = ΣiΣj Aij dij-6 + Bijdij-12 (Vanderwalls interaction)

    Eel = ΣiΣj vivj / εdij (Electrostatic interaction)

    Ehb = ΣiΣj ε [5(R0/Rij)12 -6(R0/Rij)10] (Hydrogen bond)


    Molecular mechanic force field1

    Molecular Mechanic Force Field

    • AMBER

    • CHARMM

    • GROMACS

    • ...

    • Differences

      • Terms of energy

      • Parameters

    Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA . A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995 117: 5179–5197.

    Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M: CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 1983, 4:187-217

    Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ . GROMACS: fast, flexible, and free. J Comput Chem 2005, 26 (16): 1701–18


    Statistical force field

    Statistical Force Field

    • Derived from an analysis of known structures in the Protein Data Bank

    Tanaka and Scheraga (1976) : The idea of using Boltzmann distribution to find knowledge-based force field

    Schueler-Furman O, Wang C, Bradley P, Misura K, Baker D: Progress in modeling of protein structures and interactions. Science 2005, 310:638-642.

    Bradley P, Misura KM, Baker D: Toward high-resolution de novo structure prediction for small proteins. Science 2005, 309:1868-1871.


    Statistical force field1

    Statistical Force Field

    • Reduce protein structure

    • Distribution of:

      • Distances

      • Angles

      • ASA

    Lazaridis T, Karplus M: Effective energy functions for protein structure prediction. Curr Opin Struct Biol 2000, 10:139-145.

    Bauer A, Beyer A: An improved pair potential to recognize native protein folds. Proteins Struct Funct Genet 1994, 18:254-261.

    Jernigan RL, Bahar I: Structure-derived potentials and protein simulations. Curr Opin Struct Biol 1996, 6:195-209.

    Melo F, Feytmans E: Assessing protein structures with a non-local atomic interaction energy. J Mol Biol 1998, 277:1141-1152.

    Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11:430-448.

    Tobi D, Elber R: Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins Struct Funct Genet 2000, 41:40-46.D, Elber R: Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins Struct Funct Genet 2000, 41:40-46.

    Sippl MJ: Knowledge-based potentials for proteins. Curr Opin Struct Biol 1995, 5:229-235.

    Covell DG: Folding protein α-carbon chains into compact forms by Monte Carlo methods. Proteins Struct Funct Genet 1992, 14:409-420.

    Sun S: Reduced representation model of protein structure prediction: statistical potential and genetic algorithms. Protein Sci 1993, 2:762-785.


    Statistical force field2

    Statistical Force Field

    • P(c)≈ exp(−βE(c))


    Contact potential calculation 1

    Contact Potential Calculation - 1

    • Interaction energy between AAs

    • E(interaction) = -KT ln(frequency of interaction)

      • K: constant

      • T: temperature (in K, 273K = 0 ºC)

      • Frequency of interaction: measured in database of known struct.

    • More frequent ⇒ more favourable


    Energy based on contact potentials jones

    “energy” based on contact potentials (Jones)

    • Pairwise contact potentials:

    • ΔEab(s) = -kT ln (fab(s)/f(s))

      • s : separation length

      • fab(s): frequency of occurrence of a, b with separation s

      • f(s): frequency of the separation

    • Define energy of a structure as the sum over all pairwise contact potentials.


    Limitation of contact potential method

    Limitation of Contact Potential Method

    • The energy associated with an isolated AA pair is assumed to be similar to that found in known protein structures

    • Modification: the conformation energy of groups of AAs larger than 2 may provide a more reliable prediction


    Ab initio prediction

    Ab Initio Prediction

    • Predicting the 3D structure without any “prior knowledge”

    • Used when homology modelling or threading have failed (no homologues are evident)

    • Equivalent to solving the “Protein Folding Problem”

    • Still a research problem


    Ab initio protein folding

    Ab initio protein folding

    • Aims to predict tertiary structure from basic physico-chemical principles

      • does not rely on any detection of similarity to sequences of known structure

    • An important scientific question

    • As yet very unreliable for practical predictions


    Some ab initio methods

    Some Ab Initio Methods

    • Molecular Dynamic Simulation

      • Using complex energy functions simulate folding of the primary sequence until it reaches it’s native state (1D->3D)

    • Genetic Algorithm

      • Used in refining a given potential function so that it can best predict the native state of a protein

    • Simulated Annealing

    • Branch and Bound Methods (usually used in side-chain conformation)


    Input

    INPUT

    • Sequence of amino acids

    • The chemical structures of amino acids and peptide backbone

      • constituent atoms

      • bond lengths, angles

      • constraints on dihedral angles

    • The properties of the media (water molecules, anions, cations, other molecules…)


    Output

    OUTPUT

    • 3D coordinates of atoms in the protein (or some equivalent representation)

    • We are also willing to accept partial information:

      • 3D structure of active site only

    • Location (in sequence) of secondary structures

    • Prediction of the “class” or “family” of the protein


    Is problem hard

    Is problem hard?

    • YES.

    • Huge Search Space:

    • Assume each amino acid can adopt one of three conformations (alpha, beta, coil), then chain of 100 amino acids has 3100 = 5 x 1047 possible folds.

    • If sample a fold in 10-13 seconds, it would take 1027 years.

      • Universe is 1010 years old.

    • Difficult criterion for “correct fold.”:

      • Interaction between thousands of atoms with each other, surrounding water,and surrounding molecules.


    Can it be done

    Can it be done?

    • YES.

      • Nature does it all the time.

      • Real proteins fold in the range of seconds.

    • THUS

      • Nature must not sample all conformations.

      • Nature knows the correct criterion.


    Potential energy function

    Potential Energy Function

    • How do we know when a predicted structure is the native shape of the protein ?

    In thermodynamics,

    A molecule is most stable when it’s free energy is at a minimum

    native shape is at a free energy minimum

    • The potential energy function is a simplification of actual forces acting on a real protein molecule and it’s formulation is based on the given simplified structural model


    Polypeptides can be

    Polypeptides can be...

    • Represented by a range of approaches or approximations including:

      • all atom representations in cartesian space

      • all atom representations in dihedral space

      • simplified atomic versions in dihedral space

      • tube/cylinder/ribbon representations

      • lattice models


    Ab initio folding

    Ab Initio Folding

    • Two Central Problems

      • Sampling conformational space (10100)

      • The energy minimum problem

    • The Sampling Problem (Solutions)

      • Lattice models, off-lattice models, simplified chain methods

    • The Energy Problem (Solutions)

      • Threading energies, simplified force fields, packing assessment, topology assessment


    A simple 2d lattice

    A Simple 2D Lattice

    3.5Å


    Lattice folding

    Lattice Folding


    Lattice algorithm

    Lattice Algorithm

    • Build a “n x m” matrix (a 2D array)

    • Choose an arbitrary point as your N terminal residue (start residue)

    • Add or subtract “1” from the x or y position of the start residue

    • Check to see if the new point (residue) is off the lattice or is already occupied

    • Evaluate the energy

    • Go to step 3) and repeat until done


    Lattice energy algorithm

    Lattice Energy Algorithm

    • Red = hydrophobic, Blue = hydrophilic

    • If Red is near empty space E = E+1

    • If Blue is near empty space E = E-1

    • If Red is near another Red E = E-1

    • If Blue is near another Blue E = E+0

    • If Blue is near Red E = E+0


    More complex lattices

    More Complex Lattices


    3d lattices

    3D Lattices


    Really complex 3d lattices

    Really Complex 3D Lattices

    J. Skolnick


    Lattice methods

    Lattice Methods

    • Easiest and quickest way to build a polypeptide

    • More complex lattices allow reasonably accurate representation

    Advantages

    Disadvantages

    • At best, only an approximation to the real thing

    • Does not allow accurate constructs

    • Complex lattices are as “costly” as the real thing


    The casp contest

    The CASP “contest”

    • CASP is a blind prediction contest. There is a set of structures that are crystallized but not published.

    • The predictors attempt to predict there structures.

    • The results are compared.

    • http://predictioncenter.org/casp[1,2,3,4,5,6,7,8,9]/


  • Login