slide1
Download
Skip this Video
Download Presentation
domain database

Loading in 2 Seconds...

play fullscreen
1 / 69

domain database - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

C. H. C. A. T. The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?. lass.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' domain database' - chessa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

C

H

C

A

T

The CATH domain database and associated resources - DHS, Gene3D

How do we determine domain boundaries?

How do we you identify fold groups and evolutionary superfamilies?

What is the distribution of the CATH domain families in the PDB and in the genomes?

lass

domain database

A

Orengo & Thornton 1994

rchitecture

T

opology or Fold Group

H

omologous Superfamily

slide2

Multidomain proteins

~20,000 chains from Protein Databank (PDB)

~50,000 domains in CATH structure database

~40% of the entries in CATH are multidomain

slide3

Domains are important evolutionary units

analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be multidomain

slide4

Carboxypeptidase A (2ctc)

Carboxypeptidase G2 (1cg2A)

~30% of multidomains in CATH are discontinuous

algorithms for recognising domain boundaries
DETECTIVE Swindells 1995

each domain should have a recognisable hydrophobic core

DOMAKSiddiqui & Barton, 1995

residues comprising a domain make more internal contacts than external ones

PUUHolm & Sander, 1994

parser for protein folding units: maximal interaction within domains and minimal interaction between domains

Algorithms for Recognising Domain Boundaries

Consensus is sought between the three methods – on average this occurs about 20% of the time

slide6

74%

Close homologues

29%

21%

Twilight zone

4%

Midnight zone

11%

Homologues/analogues

algorithms for recognising homologues
Sequence Based methods

close homologues – BLAST (Altschul et al.)

- SSEARCH (Smith & Waterman)

remote homologues – SAM-T99 (Karplus et al)

Structure Based Methods

close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo)

- SSAP (Taylor & Orengo)

- CORA (Orengo)

Algorithms for Recognising Homologues
slide8

74%

Close homologues

SSEARCH

29%

21%

Twilight zone

HMMs, SSAP

4%

Midnight zone

CATHEDRAL, SSAP

11%

Homologues/analogues

CATHEDRAL, SSAP

slide9

Hidden Markov Models (HMMs)

SAM-T99 Karplus Group

SAMOSA Orengo Group

Non redundant GenBank database

query sequence

hits

these methods can currently identify ~70% of remote homologues

(3 times more powerful than BLAST)

slide10

Percentage of PDB structures classified in CATH by different methods over the last 2 years

remote homologues (8.6)

analogues (1.9)

SSAP

Novel folds

2.0

1.9

remote homologues

(<30%)

HMMs

8.6

7.6

20.7

59.2

Close homologues

(>30%)

SSEARCH

Near-identical

SSEARCH

slide11

7.7

11.8

8.0

22.0

28.4

22.0

Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years

near-identical

SSEARCH

novel folds

analogues

SSAP

close homologues

(>30%)

SSEARCH

remote homologues

SSAP

remote homologues

(<30%)

HMMs

structure based algorithms for recognising homologues
CATHEDRAL Pairwise alignment - secondary structure comparison

SSAP Pairwise alignment - residue comparison

CORA Multiple alignment – residue comparison

Structure Based Algorithms for Recognising Homologues
slide13

74%

Close homologues

ssearch

29%

21%

Twilight zone

HMMs

4%

Midnight zone

CATHEDRAL, SSAP

11%

Homologues/analogues

CATHEDRAL, SSAP

slide14

structure is much more highly conserved than sequence

cholera toxin

pertussis toxin

Structure similarity (SSAP) score

97

81

Heat labile enterotoxin

79%

12%

Sequence identity

pairwise sequence identities and structure similarity ssap scores in cath domain families
Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families

structure similarity (SSAP)

score

same function

different function

sequence identity (%)

slide16

Residue insertions in the loops connecting secondary structures

  • Shifts in the orientations of secondary structures
fast structure comparison method cathedral
ignore the variable loop regions and only compare the secondary structures

derive vectors through secondary structure elements

compare closest approach distances and vector orientations using graph theory

Fast Structure Comparison Method (CATHEDRAL)

Andrew Harrison et al., JMB, 2002

slide20

d

a

b

a . b = | a || b | cos

+ dihedral angle

+chirality

compares graphs of proteins

CATHEDRALCATHs Existing Domain Recognition ALgorithm

Compares graphs of proteins

d, , , chirality

H

edge

H

d, , , chirality

d, , , chirality

H

node

slide22

Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif

A

III

A,a

I

C

III

II

B

I

C,d

IV

a

B,c

II

III

b

b

I

overlap graph has a structural motif of 3 secondary structures

d

V

II

c

slide23

Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph

In this example the common graph contains 5 nodes.

1000 times faster than residue based methods

(e.g. SSAP)

slide25

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

Score ~ common graph size

(size protein1 . size protein2)1/2

slide26

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

Score ~ common graph size

(size protein1 . size protein2)1/2

f a e b score log f log a b score

scores for unrelated structures exhibit an extreme value distribution

F = A e - b . scorelog F = log A - b .score

allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

slide28

Using CATHEDRAL to Identify Domain Boundaries

Graph based secondary structure comparison is very fast - 1000 times faster than residue based methods

New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches.

85-90% of domains in new multi-domain structures have relatives in CATH

slide29

CATHEDRAL

Multi-domain structure

Secondary structure match by graph

SSAP residue alignment

residues in new multi-domain

residues in CATH domain family 1

Fold A

residues in CATH domain family 2

Fold B

slide30

SSAP

Protein A

Protein B

Taylor & Orengo,

J. Mol. Biol. 1989

residue based structure comparison method using dynamic programming

Scores range

from 0-100

Residues in protein A

Residues in protein B

slide31

CATHEDRAL

One third of known multi-domain structures are discontinuous

reasons for structural similarity
Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence

Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space

Reasons for Structural Similarity
slide35

C

lass

Domain structure database

A

Orengo & Thornton 1994

rchitecture

T

opology or Fold Group

H

omologous Superfamily

~50,000 domains in PDB

~1500 domain superfamilies in CATH

slide36

H

C

A

T

3

Class

~36

Architecture

Topology or

Fold

~810

~50,000 domains

domain database

slide37

H

A

T

C

Topology or

Fold Group

~810

40,000 domain entries

~50,000 domain entries

Homologous

Superfamily (Domain Family)

~1500

Sequence

Family

(35%, 60%, 95%)

slide38

DHS

Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

slide39

DHS

Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

slide40

Variation in Secondary Structures Across Superfamily

DHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

slide41

Functional annotations from GO, EC, COGs, KEGG

DHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

slide42

Multiple structure alignments with conserved residues highlighted

DHS:Dictionary of Homologous superfamilies

http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

population of cath families and structural groups
Population of CATH Families and Structural Groups

~50,000 structural domains

cluster proteins with similar sequences

S

~4000 sequence families (35%)

cluster proteins with similar structures and functions

~1,500 homologous superfamilies

H

cluster proteins with similar structures

T

~810 fold groups

A

~36 architectures

C

3 major protein classes

slide44

Arc repressor-like

Rossmann Fold

OB Fold

Alpha/Beta Plaits

Jelly Roll

CATH

Arc repressor-like

nearly one third of the superfamilies belong to <10 fold groups

Up-down

Rossmann

SH3-like

OB fold

Immunoglobulin

Jelly Roll

Alpha-beta plait

TIM barrel

slide45

CATH numbering scheme

2.40.50.100

Class

2. Mainly beta

40. Barrel

Architecture

50. OB Fold

Topology

100 Heat labile

enterotoxin superfamily

Homology

slide46

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH domain structure database

slide47

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH class level

slide48

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH architecture level

slide49

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH Topology or fold group level

slide50

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamilies in each fold group

slide51

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamily level

slide52

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH sequence families (>=35% identity) in each superfamily

slide53

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH classification information for individual domains

slide54

CATH

http://www.biochem.ucl.ac.uk/bsm/cath

CATH structural relatives listed for each domain

slide55

CATH server

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

slide56

CATH server

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

slide57

CATH server

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

structural matches and statistics listed for query domain

slide58

Expanding CATH with sequence relatives from genomes

  • Library of HMMs built for representative sequences from each CATH domain superfamily

Scan

against CATH

HMM library

protein sequences

from genomes

assign domains to

CATH superfamilies

slide59

Expanding CATH

~1400 Domain Structure Superfamilies

S1

sequences added from GenBank, genomes, SWPT-TrEMBL

S1

S2

H

S2

H

S3

Homologous Superfamily

Homologous Superfamily

S3

CATH-HMMs

S4

Sequence family

S5

~50,000 sequences

~4,000 sequence families

~600,000 sequences

~24,000 sequence families

Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

slide60

Arc repressor-like

Four helix bundle

Alpha horseshoe fold

SH3-type barrel

Rossmann Fold

OB Fold

Immunoglobulin-like

Jelly Roll

Alpha/Beta Plaits

TIM Barrel

Arc repressor-like

Gene3D

Up-down

Alpha

horseshoe

SH3-like

OB fold

Rossmann

Immunoglobulin

Jelly Roll

TIM barrel

Alpha-beta plait

slide61

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

CATH domain structure annotations for complete genomes

slide62

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Individual genome statistics

slide63

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Assignment of sequences to Gene3D protein families

slide64

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences

slide65

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences

slide66

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences

slide67

Gene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences

summary
Summary
  • CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB
  • These domains families contain over 600,000 domain sequences from the genomes and sequence databases
  • Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading
slide69
Frances Pearl

Ian Sillitoe

Oliver Redfern

Mark Dibley

Tony Lewis

Chris Bennett

Andrew Harrison

Gabrielle Reeves

Alastair Grant

David Lee

Acknowledgements

Janet Thornton

http://www.biochem.ucl.ac.uk/bsm/cath

Medical Research Council,

Wellcome Trust, NIH

Biotechnology and Biological Sciences Research Council

ad