Extracting and Exploiting Structural Patterns in Proteins, especially Relating to Function - PowerPoint PPT Presentation

Extracting and exploiting structural patterns in proteins especially relating to function l.jpg
Download
1 / 87

Extracting and Exploiting Structural Patterns in Proteins, especially Relating to Function. Janet Thornton James Watson, Roman Laskowski - EBI Adel Golovin, Kim Henrick - EBI MSD David Leader, James Milner-White – Glasgow Andrzej Joachimiak, Aled Edwards – MCSG

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Extracting and Exploiting Structural Patterns in Proteins, especially Relating to Function

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Extracting and exploiting structural patterns in proteins especially relating to function l.jpg

Extracting and Exploiting Structural Patterns in Proteins, especially Relating to Function

Janet Thornton

James Watson, Roman Laskowski - EBI

Adel Golovin, Kim Henrick - EBI MSD

David Leader, James Milner-White – Glasgow

Andrzej Joachimiak, Aled Edwards – MCSG

(Mid-West Centre for Structural Genomics)


Outline l.jpg

Outline

  • Structural Motifs

    • PDBsum

    • MSDmotif

  • Functional Motifs

    • Catalytic Site Atlas

    • DNA Binding Motifs

    • Automated templates

    • Reverse Templates

  • From Structure to Function? - ProFunc


Structural motifs l.jpg

Structural Motifs

Structural motifs are commonly occurring small sections of proteins – that are distinguished by:

Sequence – Gly-X-Gly

Conformation – , angles

Secondary structure - helix, bab unit

Function – catalytic triad, calcium binding site


Examples of structural motifs l.jpg

Examples of Structural Motifs

AlphaBeta Motif

Beta Turn

Schellmann Loop

Beta Bulge (classic)

Nest

Beta Bulge Loop


Structural motifs5 l.jpg

Structural Motifs

They may be continuous along the chain (e.g. GXG) or discontinuous (e.g. catalytic triad)

Historically motifs were identified and analysed in an effort to understand the relationship between protein sequence and structure, to improve prediction methods. They are also used to assign function (Prosite).

Many motifs can now be recognised automatically from coordinates, using programmes such as DSSP and Promotif

PDB files can be annotated with these structural motifs e.g. in PDBsum


Http www ebi ac uk thornton srv databases pdbsum l.jpg

http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/

Roman Laskowski


Example page l.jpg

Example page


Protein detail l.jpg

Protein detail


Msd motif http www ebi ac uk msd srv msdmotif l.jpg

MSD motifhttp://www.ebi.ac.uk/msd-srv/msdmotif

Adel Golovin

Currently alpha test

Full Release probably ~Oct 2005

PDB: 1gci


Msd motif l.jpg

MSD motif

Small 3D motifs from J.Milner-White search/view

Secondary structure patterns (HTH) search/view

,, based search/view

Ligands and their environment search/view

Catalytic sites search/view

Blast sequence search/view

Prosite compliant patterns search/view

3D multiple alignment


Msdmotif options l.jpg

MSDmotif options


Small motifs l.jpg

Small motifs

Alpha-Beta Motif

Nest

ST staple

11 motifs in total (Prof James Milner-White)

http://doolittle.ibls.gla.ac.uk:9006/david/ProteinMotifDB.html


Motifs in msdmotif 1 l.jpg

Motifs In MSDmotif (1)

AlphaBeta Motif

Beta Turn

Schellmann Loop

Beta Bulge (classic)

Nest

Beta Bulge Loop


Motifs in msdmotif 2 l.jpg

Motifs In MSDmotif (2)

Asx Motif

ST Motif

Asx Turn

ST Turn

ST Staple


Statistics provided by msdmotif stmotif l.jpg

Statistics provided by MSDmotifSTmotif

a)

b)

c)

  • Amino acid occurrence at each position

  • Correlation between side chain charge and residue position

  • Motif parameter variation


Hit list after clicking l.jpg

Hit List after clicking


Small motifs 3d alignment from different families l.jpg

Small motifs – 3D alignmentfrom different families

ST-staple


Msdmotif options18 l.jpg

MSDmotif options


Secondary structure patterns l.jpg

Strand – turn – Strand

2-3 residues gap

Glycosylation pattern N{P}[ST]{P}

Secondary structure patterns

Where N binds sugar: Man or Nag


Search l.jpg

,, search

PDB:1gci

Ideal for short loops search


Example of a search using msdmotif l.jpg

Example of a search using MSDmotif

PDB:1gci

Subtilases family

PDB:1f5p

Globins family

Phi/Psi Search using MSDmotif

+ Other Subtilases

Calcium binding site


Sequence search l.jpg

Sequence search

ZN binding pattern: CXXCXXXFXXXXXLXXHXXXH


3d alignment l.jpg

3D alignment


Msd motif24 l.jpg

MSD motif

  • Available in alpha version

    • http://www.ebi.ac.uk/msd-srv/msdmotif

  • Will be published later this year

    • Incremental weekly update

    • 20 G disk space on Oracle DB, linear dependency

      ~ 0.8 M per PDB

  • Web application server with J2EE servlet engine

  • NCBI Blast

Adel Golovin

Kim Henrick


Outline25 l.jpg

Outline

  • Structural Motifs

    • PDBsum

    • MSDmotif

  • Functional Motifs

    • Catalytic Site Atlas

    • DNA Binding Motifs

    • Automated templates

    • Reverse Templates

  • From Structure to Function? - ProFunc


Slide26 l.jpg

Catalytic Site Atlas

  • Taken from primary literature:

    • -lactamase Class A

    • EC: 3.5.2.6

    • PDB: 1btl

    • Reaction: -lactam + H2O  -amino acid

    • Active site residues: S70, K73, S130, E166

    • Plausible mechanism:


Slide27 l.jpg

The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data.

Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton

Nucl. Acids. Res. 2004 32: D129-D133.

http://www.ebi.ac.uk/thornton-srv/databases/CSA


Slide28 l.jpg

  • Annotates catalytic residues in the PDB

  • Based on a dataset of 514 enzyme families

    • Representative catalytic site for each family

    • Homologues assigned by Psi-BLAST

    • Limited substitution allowed.

    • Homologues updated monthly.

  • Literature references

  • Data also available via MSDsite

  • http://www.ebi.ac.uk/thornton-srv/databases/CSA

  • http://www.ebi.ac.uk/msd-srv/msdsite


3 d templates l.jpg

3-D templates

  • Use 3D templates to describe the active site of the enzyme

    • analogous to 1-D sequence motifs such as PROSITE, butin 3-D

  • Sequence position independent

  • Captures essence of functional site in protein


Slide30 l.jpg

Pepsin


Aspartic proteinase active site residues dtg x2 l.jpg

Aspartic Proteinase - Active Site residues - [DTG]x2

Eukaryotic & Fungal Aspartic Proteinases:

all-atom DTG-DTG Template


Aspartic proteases active site template l.jpg

Aspartic Proteases: Active Site Template

Asp CO2

Gly C

A template of 8 atoms

is sufficient to identify

all Aspartic Proteinases

Asp O

Gly C

Thr/Ser

O

Thr O


Aspartic protease template search against all pdb l.jpg

Aspartic Protease Template Search against all PDB

green= true

red=false


Template search and superposition tess l.jpg

TEmplate Search and Superposition TESS

Wallace et al., 1997

  • defines a functional site as a sequence-independent set of atoms in 3-D space

  • search a new structure for a functional site

  • search a database of structures for similar clusters

e.g. serine proteinase,

catalytic triad


Serine proteinase templates l.jpg

Serine Proteinase templates

  • A trypsin-based template of 7 atoms was able to identify almost all serine proteinases in PDB- including subtilisin

  • It also identified active sites of several other functionally distinct enzyme families - serine carboxypeptidase, acetylcholine esterase; lipase; dehalogenase

  • The catalytic triad has evolved independently many times


Slide36 l.jpg

Active site convergence

Trypsin

Subtilisin


Slide37 l.jpg

Trypsin

Subtilisin

Alpha/beta hydrolase

Brain platelet activating factor acetylhydrolase

Clp protease

CheB methylesterase


3d templates to characterise functional sites l.jpg

(~600 Metal binding site templates)

(189 enzyme active site templates)

3D Templates to Characterise Functional Sites

Template searches


Database of enzyme active site templates l.jpg

GARTfase

Cholesterol oxidase

IIAglc histidine kinase

Database of enzyme active site templates

189 templates

Carbamoylsarcosine

amidohhydrase

Ser-His-Asp

catalytic triad

Dihydrofolate reductase


Slide40 l.jpg

DNA

Protein

+


Dna binding motifs l.jpg

DNA-binding Motifs

  • Helix-Turn-Helix (HTH)

    • Standard HTH

    • Winged helix

  • Beta Sheet

  • Zinc-finger


Prediction of dna binding function using structural motifs l.jpg

Prediction of DNA Binding Function using Structural Motifs

  • Predicting function from structure

  • Structural motifs

  • Helix-Turn-Helix (HTH)

  • Bind in major groove

  • Carboxyl terminal helix - DNA recognition

  • 1/3 DNA-binding protein families (16/54)

  • Brennan and Mathews 1989: Brennan, 1991


Hth motif proteins l.jpg

HTH Motif Proteins

Catabolic activator protein (1ber)

Lambda repressor/operator complex (1lmb)


Hth motif templates l.jpg

HTH Motif Templates

3D template library

(E.g. 1berA16-36)


Predicting dna binding function l.jpg

Predicting DNA binding function

  • Scanning template library against 3D structures

  • One templateT(length n) scanned against proteinP of length m, RMSD calculated optimal superposition at each m-n+1 possible positions in P

  • Calculate lowest RMSD for optimal superposition


Slide46 l.jpg

Ideal RMSD distribution


Slide47 l.jpg

RMSD Distributions with HTH templates

1.2Å

RMSD

831/23,506 = 3.5% false positives

2/142 = 1.4% false negatives


Hth motif extended templates l.jpg

HTH Motif Extended Templates

  • Extend templates by adding +2 residues to start and end

  • 1berA16-36

  • 1berA14-38


Slide49 l.jpg

RMSD Distributions with extended HTH templates

1.2Å

110/23,506 = 0.5% false positives

2/144 = 1.4% false negatives


Slide50 l.jpg

Comparison of RMSD Distributions


Hth accessible surface area l.jpg

Data Set

Min

Max

Mean

HTH Proteins

(144)

990

2740

1732

False Positives (110)

856

2747

1264

HTH Accessible Surface Area

ASA threshold 990Å2 reduced false positives from 110 to 80

False positive rate of 0.3% (80/23506)


Summary l.jpg

Summary

  • Structural template library of 144 HTH motifs

  • Minimum RMSD for optimal superpositions on whole protein structures based on C coordinates

  • Thresholds of 1.2Å RMSD and 990Å2 ASA

  • Hit rate of 98.6% & false positive rate of 0.3%

  • Recognition across sequence families and fold families


Template databases l.jpg

Template databases

  • HAND CURATED

    • Enzyme active sites (PROCAT) – 189 templates

      • Currently being extended

    • Metal-binding sites – 600 templates

  • AUTOMATED

    • Ligand-binding sites – 10,000 templates

    • DNA-binding sites – 800 templates


Slide54 l.jpg

1. Ligand-binding templates

Automatically generated templates

a. For each Het Group in the PDB extract a non-homologous data set of proteins binding that Het Group

b. Identify residues interacting with ligand (via H-bonds or non-bonded contacts)

c. Templates generated from overlapping local groups of 3-residue clusters

d. Gives over 10,000 ligand-binding templates


Slide55 l.jpg

2. DNA-binding templates

Automatically generated templates

a. Extract a non-homologous data set of DNA/RNA-binding proteins from the PDB

b. Identify residues interacting with DNA/RNA (via H-bonds or non-bonded contacts)

c. Templates generated from overlapping local groups of 3-residue clusters

d. Gives over 800 DNA/RNA-binding templates


Slide56 l.jpg

Problems with automated template methods

  • WITH A LARGE NUMBER OF TEMPLATES:

  • Too many hits (usually tens, and often hundreds)

  • Use of rmsd rarely discriminates true from false positives

  • Local distortion in structure may give a large rmsd

  • Top hit rarely the correct hit – even in “obvious” cases


Slide57 l.jpg

PDB code: 1hsk

UDP-N-acetylenolpyruvoylglucosamine

reductase (MURB)

E.C.1.1.1.158

Glu

Contains the 3D template that characterises

this enzyme class

Sequence identity to template’s

representative structure (1mbb) is 28%

Ser

Arg

An example


Slide58 l.jpg

Ser

rmsd=2.19Å

Arg

Hit E.C number Rmsd Enzyme

Glu

1. E.C.1.3.99.2 0.76Å Acyl-CoA dehydrogenase

2. E.C.4.2.1.20 0.76Å Tryptophan synthase α-subunit

3. E.C.3.2.1.73 1.19Å Glycosyl hydrolases, family 17

4. E.C.3.2.1.73 1.21Å Glycosyl hydrolases, family 16

5. E.C.4.1.2.13 1.25Å Fructose-bisphosphate aldolase (class I)

… … …

… … …

386.… 3.94Å …

Enzyme active site templates

Hits for 1hsk

102. E.C.1.1.1.158 2.19Å UDP-N-acetylmuramate dehydrogenase


Slide59 l.jpg

Ser

Arg

Glu

Comparison of template environments

Similar residues in

neighbourhood:

Template structure – 1mbb

Target structure – 1hsk


Slide60 l.jpg

Ser

Match to template:

Arg

Glu

Template structure – 1mbb

Target structure – 1hsk

Comparison of template environments


Slide61 l.jpg

Ser

Match to template:

Arg

Glu

Template structure – 1mbb

Target structure – 1hsk

Comparison of template environments


Slide62 l.jpg

Environment similarity score

Slices through 10Å sphere centred on template match

Template

structure

1mbb

Target

structure

1hsk

Score equivalent grid-points using Dayhoff matrix and taking voids into account

Total similarity score obtained from sum of all grid-point scores


Slide63 l.jpg

Results for 1hsk

Hit E.C number Rmsd Score Enzyme

1. E.C.1.1.1.158 2.08 209.1 UDP-N-acetylmuramate dehydrogenase

2. E.C.3.2.1.14 2.13 146.0 Chitinase A chitodextrinase 1,4-beta-poly-N-acetylglucosaminidase

coly-beta-glucosaminidase

3. E.C.3.2.1.17 1.92 142.4 Turkey lysozyme

4. E.C.3.2.1.17 1.89 138.7 Hen lysozyme

5. E.C.3.5.1.26 1.47 132.3 Aspartylglucosylaminidase

6. E.C.3.2.1.3 1.54 131.1 Glucan 1,4-alpha-glucosidase


Slide64 l.jpg

Rank template hits according to conservation scores of the matched residues

Hit E.C number Rmsd Signif Enzyme

1. E.C.1.1.1.158 2.08Å 98.3% UDP-N-acetylmuramate dehydrogenase

2. E.C.3.5.1.11 2.06Å 98.3% Penicillin acylase

3. E.C.5.99.1.2 2.22Å 98.3% Topoisomerase Ia/II

4. E.C.5.1.2.2 2.69Å 98.3% Mandelate racemase

5. E.C.5.1.2.2 2.59Å 97.8% Topoisomerase Ia/II

… … ……

Residue conservation


Slide65 l.jpg

Rank by conservation and proximity to protein’s two largest clefts

Hit E.C number Rmsd Signif Enzyme

1. E.C.5.1.2.2 2.69Å 98.4% Mandelate racemase

2. E.C.1.1.1.158 2.08Å 98.3% UDP-N-acetylmuramate dehydrogenase

3. E.C.3.5.1.11 2.06Å 98.3% Penicillin acylase

4. E.C.5.99.1.2 2.22Å 98.3% Topoisomerase Ia/II

5. E.C.5.1.2.2 2.59Å 97.8% Topoisomerase Ia/II

… … ……

Residue conservation and cleft proximity


Slide66 l.jpg

3-residue templates

1

2

3

4

5

6

7

8

9

1hsk

1hsk

“Reverse” templates


Slide67 l.jpg

Comparison of template environments

Identical residues in

neighbourhood:

Template structure – 1mbb

Target structure – 1hsk


Slide68 l.jpg

“Reverse” templates

  • Typically get 20-40 templates from a single structure

  • Search each template vs PDB (or representative subset)

  • Non-homologous dataset of 2,500 protein chains

  • Focused search (eg top DALI hits)

  • Locate known PDB entries with closest local similarity

  • Program called: the Protein SiteSeer

  • Times for search vs 2,500 set

  • JESS – 30 minutes

  • SiteSeer – 3 hours


Slide69 l.jpg

evolutionary relationships

biological multimeric state

INTERACTIONS

MULTIMERS

FOLD

Structure to Function

Structure to Function

SURFACE

MUTANTS & SNPs

3D STRUCTURE

ELECTROSTATICS

LIGANDS

CLUSTERS

enzyme active sites

ligand & functional sites

catalytic clusters, mechanisms & motifs


Protein function l.jpg

Protein Function

Protein function has many definitions:

  • Biochemical Function - The biochemical role of the protein e.g. serine protease

  • Biological Function - The role of the protein in the cell/organism e.g.digestion, blood clotting, fertilisation

    The 3D structure usually only provides information about biochemical function


250 structures solved to date by mcsg l.jpg

ylxR hypothetical cytosolic protein

ygbM hypothetical protein (EC1530)

Hypothetical protein (MTH1)

Conserved hypothetical protein (MT777)

Hypothetical protein (EC4030_F)

cutA protein implicated in Cu homeostasis (TM1056)

~250 structures solved to date by MCSG

~40% are ‘hypothetical proteins’

Some examples …


From gene to biochemical function l.jpg

Gene  Protein  3D Structure  Function

Identifying sequence or structural similarity

(i.e. identifying an evolutionary relationship)

is the most powerful route to function

Identification

From Gene To Biochemical Function


From gene to biochemical function73 l.jpg

From Gene To Biochemical Function

Gene  Protein  3D Structure  Function

Given a protein structure:

  • Where is the functional site?

  • Which ligands bind to the protein?


Predicting function from 3d structure conservation l.jpg

Predicting function from 3D Structure: conservation

Residue conservation

  • Conservation

  • Valdar & Thornton

  • Lichtarge et al.

  • Aloy et al.

  • Glaser et al.

  • Etc.…..


Predicting function from 3d structure binding sites l.jpg

Predicting function from 3D structure: binding sites

Binding sites

  • Binding site comparison

  • Geometrical hashing

  • eF-site (Nakamura et al.)

  • PINTS (Russell)

  • Pseudospheres (Klebe)

  • pvSOAR (Binkowski et al.)

  • etc


Predicting function from 3d structure templates l.jpg

  • Template methods

  • PROCAT/CSA (Wallace..)

  • ASSAM (Artymuik..)

  • Rigor/Spasm (Kleywegt)

  • MSD (Henrick, Oldfield…

  • etc

Predicting function from 3D Structure: templates

3D templates


Predicting binding site l.jpg

Surface clefts

Residue conservation

Most likely binding site

Conserved surface patches

Predicting Binding Site

Binding-site analysis: cutA


Identifying binding site function using motifs l.jpg

Identifying Binding Site Function Using Motifs

- 3D enzyme active site structural motifs (Craig Porter)

- Catalytic Site Atlas - Identification of catalytic residues (Gail Bartlett, Alex Gutteridge)

- Metal binding sites (Malcolm MacArthur)

- Binding site features (Gareth Stockwell)

- Automatically generated templates of ligand-binding and

- DNA binding motifs (Sue Jones, Hugh Shanahan)

- “Reverse” templates (Roman Laskowski)

JESS – fast template search algorithm (Jonathan Barker)


Slide79 l.jpg

MCSG structure

BioH – unknown function

involved in biotin synthesis

in E.coli

Expected to be an enzyme

Sequence contains two Gly-X-Ser-X-Gly motifs typical of

acyltransferases and thioesterases

An example

Structure: Rossmann fold, hence many

structural homologues


Slide80 l.jpg

Ser-His-Asp catalytic triad of the lipases with rmsd=0.28Å

(template cut-off is 1.2Å)

Experimentally confirmed by hydrolase assays

Novel carboxylesterase acting on short acyl chain substrates

PROCAT template search

One very strong hit


Slide81 l.jpg

ProFunc – function from 3D structure

Homologous structures of known function

Homologous sequences of known function

DNA-, ligand- binding and “reverse” templates

Residue conservation analysis

Functional sequence motifs

Binding site identification and analysis

Q-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]

HTH-motifs Electrostatics Surface comparison

… etc

Enzyme active site 3D-templates

Roman Laskowski


Slide82 l.jpg

http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/


Slide83 l.jpg

Goal: Function Prediction from Structure

Roman

Laskowski

James

Watson


Slide84 l.jpg

Goal: Function Prediction from Structure


Mcsg dataflow l.jpg

MCSG Dataflow

Crystallographers

(Structure Solution)

Deposition and release

Central Database

Function Prediction (Neural Network)

NIH Report

Experimental Validation Of Function


Functional annotation l.jpg

Functional Annotation

Confident

42/102

Putative

50/102

Conflicting

10/102

All MCSG structures are automatically run through ProFunc.

The results are examined manually to try to estimate the most likely function. The most recent (Nov 2004) dataset contains 193 unique structures:

Some assignment possible

102 (53%)

Function remains

unknown

23 (12%)

Prior function

known

68 (35%)


Acknowledgements l.jpg

Acknowledgements

James Watson, Roman Laskowski - EBI

Adel Golovin, Kim Henrick - EBI MSD

David Leader, James Milner-White – Glasgow

Andrzej Joachimiak, Aled Edwards – MCSG

(Mid-West Centre for Structural Genomics)

http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/

http://www.ebi.ac.uk/msd-srv/msdmotif

http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/


  • Login