Domain database
Download
1 / 69

domain database - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

C. H. C. A. T. The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?. lass.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'domain database' - chessa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Domain database

C

H

C

A

T

The CATH domain database and associated resources - DHS, Gene3D

How do we determine domain boundaries?

How do we you identify fold groups and evolutionary superfamilies?

What is the distribution of the CATH domain families in the PDB and in the genomes?

lass

domain database

A

Orengo & Thornton 1994

rchitecture

T

opology or Fold Group

H

omologous Superfamily


Domain database

Multidomain proteins

~20,000 chains from Protein Databank (PDB)

~50,000 domains in CATH structure database

~40% of the entries in CATH are multidomain


Domain database

Domains are important evolutionary units

analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be multidomain


Domain database

Carboxypeptidase A (2ctc)

Carboxypeptidase G2 (1cg2A)

~30% of multidomains in CATH are discontinuous


Algorithms for recognising domain boundaries

DETECTIVE Swindells 1995

each domain should have a recognisable hydrophobic core

DOMAKSiddiqui & Barton, 1995

residues comprising a domain make more internal contacts than external ones

PUUHolm & Sander, 1994

parser for protein folding units: maximal interaction within domains and minimal interaction between domains

Algorithms for Recognising Domain Boundaries

Consensus is sought between the three methods – on average this occurs about 20% of the time


Domain database

74%

Close homologues

29%

21%

Twilight zone

4%

Midnight zone

11%

Homologues/analogues


Algorithms for recognising homologues

Sequence Based methods

close homologues – BLAST (Altschul et al.)

- SSEARCH (Smith & Waterman)

remote homologues – SAM-T99 (Karplus et al)

Structure Based Methods

close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo)

- SSAP (Taylor & Orengo)

- CORA (Orengo)

Algorithms for Recognising Homologues


Domain database

74%

Close homologues

SSEARCH

29%

21%

Twilight zone

HMMs, SSAP

4%

Midnight zone

CATHEDRAL, SSAP

11%

Homologues/analogues

CATHEDRAL, SSAP


Domain database

Hidden Markov Models (HMMs)

SAM-T99 Karplus Group

SAMOSA Orengo Group

Non redundant GenBank database

query sequence

hits

these methods can currently identify ~70% of remote homologues

(3 times more powerful than BLAST)


Domain database

Percentage of PDB structures classified in CATH by different methods over the last 2 years

remote homologues (8.6)

analogues (1.9)

SSAP

Novel folds

2.0

1.9

remote homologues

(<30%)

HMMs

8.6

7.6

20.7

59.2

Close homologues

(>30%)

SSEARCH

Near-identical

SSEARCH


Domain database

7.7 methods over the last 2 years

11.8

8.0

22.0

28.4

22.0

Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years

near-identical

SSEARCH

novel folds

analogues

SSAP

close homologues

(>30%)

SSEARCH

remote homologues

SSAP

remote homologues

(<30%)

HMMs


Structure based algorithms for recognising homologues

CATHEDRAL methods over the last 2 years Pairwise alignment - secondary structure comparison

SSAP Pairwise alignment - residue comparison

CORA Multiple alignment – residue comparison

Structure Based Algorithms for Recognising Homologues


Domain database

74% methods over the last 2 years

Close homologues

ssearch

29%

21%

Twilight zone

HMMs

4%

Midnight zone

CATHEDRAL, SSAP

11%

Homologues/analogues

CATHEDRAL, SSAP


Domain database

structure is much more highly conserved than sequence methods over the last 2 years

cholera toxin

pertussis toxin

Structure similarity (SSAP) score

97

81

Heat labile enterotoxin

79%

12%

Sequence identity


Pairwise sequence identities and structure similarity ssap scores in cath domain families
Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families

structure similarity (SSAP)

score

same function

different function

sequence identity (%)


Domain database




Fast structure comparison method cathedral

ignore the variable loop regions and only compare the secondary structures

derive vectors through secondary structure elements

compare closest approach distances and vector orientations using graph theory

Fast Structure Comparison Method (CATHEDRAL)

Andrew Harrison et al., JMB, 2002


Domain database

d compare the secondary structures

a

b

a . b = | a || b | cos

+ dihedral angle

+chirality


Compares graphs of proteins

CATHEDRAL compare the secondary structuresCATHs Existing Domain Recognition ALgorithm

Compares graphs of proteins

d, , , chirality

H

edge

H

d, , , chirality

d, , , chirality

H

node


Domain database

Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif

A

III

A,a

I

C

III

II

B

I

C,d

IV

a

B,c

II

III

b

b

I

overlap graph has a structural motif of 3 secondary structures

d

V

II

c


Domain database

Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph

In this example the common graph contains 5 nodes.

1000 times faster than residue based methods

(e.g. SSAP)


Domain database

Performance find the largest common graph


Domain database

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

Score ~ common graph size

(size protein1 . size protein2)1/2


Domain database

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

Score ~ common graph size

(size protein1 . size protein2)1/2


F a e b score log f log a b score

scores for unrelated structures exhibit an extreme value distribution

F = A e - b . scorelog F = log A - b .score

allows you to calculate the probability (P-value, E-value) of obtaining any score by chance


Domain database

Using CATHEDRAL to Identify Domain Boundaries distribution

Graph based secondary structure comparison is very fast - 1000 times faster than residue based methods

New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches.

85-90% of domains in new multi-domain structures have relatives in CATH


Domain database

CATHEDRAL distribution

Multi-domain structure

Secondary structure match by graph

SSAP residue alignment

residues in new multi-domain

residues in CATH domain family 1

Fold A

residues in CATH domain family 2

Fold B


Domain database

SSAP distribution

Protein A

Protein B

Taylor & Orengo,

J. Mol. Biol. 1989

residue based structure comparison method using dynamic programming

Scores range

from 0-100

Residues in protein A

Residues in protein B


Domain database

CATHEDRAL distribution

One third of known multi-domain structures are discontinuous


Reasons for structural similarity

Divergence distribution - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence

Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space

Reasons for Structural Similarity


Domain database

C distribution

lass

Domain structure database

A

Orengo & Thornton 1994

rchitecture

T

opology or Fold Group

H

omologous Superfamily

~50,000 domains in PDB

~1500 domain superfamilies in CATH


Domain database

H distribution

C

A

T

3

Class

~36

Architecture

Topology or

Fold

~810

~50,000 domains

domain database


Domain database

H distribution

A

T

C

Topology or

Fold Group

~810

40,000 domain entries

~50,000 domain entries

Homologous

Superfamily (Domain Family)

~1500

Sequence

Family

(35%, 60%, 95%)


Domain database

DHS distribution

Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily


Domain database

DHS distribution

Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily


Domain database

Variation in Secondary Structures Across Superfamily distribution

DHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs


Domain database

Functional annotations from GO, EC, COGs, KEGG distribution

DHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs


Domain database

Multiple structure alignments with conserved residues highlighted

DHS:Dictionary of Homologous superfamilies

http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D


Population of cath families and structural groups
Population of CATH Families and Structural Groups highlighted

~50,000 structural domains

cluster proteins with similar sequences

S

~4000 sequence families (35%)

cluster proteins with similar structures and functions

~1,500 homologous superfamilies

H

cluster proteins with similar structures

T

~810 fold groups

A

~36 architectures

C

3 major protein classes


Domain database

Arc repressor-like highlighted

Rossmann Fold

OB Fold

Alpha/Beta Plaits

Jelly Roll

CATH

Arc repressor-like

nearly one third of the superfamilies belong to <10 fold groups

Up-down

Rossmann

SH3-like

OB fold

Immunoglobulin

Jelly Roll

Alpha-beta plait

TIM barrel


Domain database

CATH numbering scheme highlighted

2.40.50.100

Class

2. Mainly beta

40. Barrel

Architecture

50. OB Fold

Topology

100 Heat labile

enterotoxin superfamily

Homology


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH domain structure database


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH class level


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH architecture level


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH Topology or fold group level


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamilies in each fold group


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamily level


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH sequence families (>=35% identity) in each superfamily


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH classification information for individual domains


Domain database

CATH highlighted

http://www.biochem.ucl.ac.uk/bsm/cath

CATH structural relatives listed for each domain


Domain database

CATH server highlighted

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl


Domain database

CATH server highlighted

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl


Domain database

CATH server highlighted

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

structural matches and statistics listed for query domain


Domain database

Expanding CATH with sequence relatives from genomes highlighted

  • Library of HMMs built for representative sequences from each CATH domain superfamily

Scan

against CATH

HMM library

protein sequences

from genomes

assign domains to

CATH superfamilies


Domain database

Expanding CATH highlighted

~1400 Domain Structure Superfamilies

S1

sequences added from GenBank, genomes, SWPT-TrEMBL

S1

S2

H

S2

H

S3

Homologous Superfamily

Homologous Superfamily

S3

CATH-HMMs

S4

Sequence family

S5

~50,000 sequences

~4,000 sequence families

~600,000 sequences

~24,000 sequence families

Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies


Domain database

Arc repressor-like highlighted

Four helix bundle

Alpha horseshoe fold

SH3-type barrel

Rossmann Fold

OB Fold

Immunoglobulin-like

Jelly Roll

Alpha/Beta Plaits

TIM Barrel

Arc repressor-like

Gene3D

Up-down

Alpha

horseshoe

SH3-like

OB fold

Rossmann

Immunoglobulin

Jelly Roll

TIM barrel

Alpha-beta plait


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

CATH domain structure annotations for complete genomes


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Individual genome statistics


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Assignment of sequences to Gene3D protein families


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences


Domain database

Gene3D highlighted

http://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences


Summary
Summary highlighted

  • CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB

  • These domains families contain over 600,000 domain sequences from the genomes and sequence databases

  • Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading


Domain database

Frances Pearl highlighted

Ian Sillitoe

Oliver Redfern

Mark Dibley

Tony Lewis

Chris Bennett

Andrew Harrison

Gabrielle Reeves

Alastair Grant

David Lee

Acknowledgements

Janet Thornton

http://www.biochem.ucl.ac.uk/bsm/cath

Medical Research Council,

Wellcome Trust, NIH

Biotechnology and Biological Sciences Research Council