- 135 Views
- Uploaded on
- Presentation posted in: General

Fold Recognition

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Bioinformatics Subtopics

SecondaryStructurePrediction

FoldRecognition

HomologyModeling

Docking &Drug Design

E-literature

FunctionClass-ification

SequenceAlignment

Expression Clustering

ProteinGeometry

DatabaseDesign

GenePrediction

Large-ScaleGenomicSurveys

Structural

Informatics

StructureClassification

GenomeAnnotation

Databases

NCBI GenBank- Protein and DNA sequence

NCBI Human Map - Human Genome Viewer

NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito

TIGR - The Institute for Genome Research

SwissProt - Protein Sequence and Function

ProDom - Protein Domains

Pfam - Protein domain families

ProSite - Protein Sequence Motifs

Protein Data Base (PDB) - Coordinates for Protein 3D structures

SCOP Database- Domain structures organized into evolutionary families

HSSP - Domain database using Dali

FlyBase

WormBase

PubMed / MedLine

Sequence Alignment Tools

BLAST

Clustal MSAs

FASTA

PSI-Blast

Hidden Markov Models

3D Structure Alignments / Classifications

Dali

VAST

PRISM

CATH

SCOP

Databases

NCBI GenBank- Protein and DNA sequence

NCBI Human Map - Human Genome Viewer

NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito

TIGR - The Institute for Genome Research

SwissProt - Protein Sequence and Function

ProDom - Protein Domains

Pfam - Protein domain families

ProSite - Protein Sequence Motifs

Protein Data Base (PDB) - Coordinates for Protein 3D structures

SCOP Database- Domain structures organized into evolutionary families

HSSP - Domain database using Dali

CATH Database

FlyBase

WormBase

PubMed / MedLine

Sequence Alignment Tools

BLAST

Clustal MSAs

FASTA

PSI-Blast

Hidden Markov Models

3D Structure Alignments / Classifications

Dali

CATH

SCOP

VAST

PRISM

A B C - N Y R Q C L C R - P MA Y C Y N - R - C K C R B P

}

“Twilight zone”

% Sequence Identity

Homologous 3D Structure

Non-homologous

3D Structure

Residues Aligned

Adapted from Chris Sander

19% Seq ID

Z = 12.2

Adenylate Kinase Guanylate Kinase

Asp tRNA Synthetase

Staphylococcal Nuclease

CspA

Gene 5 ssDNA Binding Protein

Topoisomerase I

CspB

Protein Domains

“Independent Folding Units”

50 - 350 residues

Mean size - 125 residues

Alpha folds; Beta Folds;

Alpha+Beta Folds; Alpha/Beta Folds

COG 272, BRCT family

P. Bork et al

Cadherins

courtesy of C. Chothia

All beta

All alpha

alpha + beta

alpha / beta

- Classification of Protein Folds
- SCOP
- CATH
- DALI / FSSP

Most proteins in biology have been

produced by the duplication,

divergence and recombination of

the members of a small number of

protein families.

courtesy of C. Chothia

Average Domain Size: 170 residues

courtesy of C. Chothia

Class - 5

Fold - ~500

Superfamily - ~ 700

Family ~ 1000

Family - domains with common evolutionary origin

Homology: Derived by evolutionary divergence

All a folds

All b folds

a + b folds

a / b folds

small irregular folds

courtesy of C. Chothia

courtesy of C. Chothia

courtesy of C. Chothia

courtesy of C. Chothia

courtesy of C. Chothia

courtesy of C. Chothia

CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108.

Pearl, F.M.G, Lee, D., Bray, J.E, Sillitoe, I., Todd, A.E., Harrison, A.P., Thornton, J.M. and Orengo, C.A. (2000) Assigning genomic sequences to CATH Nucleic Acids Research. Vol 28. No 1. 277-282

Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically.

Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually.

The topology level clusters structures according to their topological connections and numbers of secondary structures.

The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

Representations of Protein Stuctures

a - full atom

b,c - strands / helices

d - Topology diagrams

Hb

Alignment of Individual Structures

Fusing into a Single Fold “Template”

Mb

Hb VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAV

||| .. | |.|| | . | . | | | | | | | .| .| || | || .

Mb VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAIL

Hb AHVD-DMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------

| | . || | .. . .| .. | |..| . . | | . ||.

Mb KK-KGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG

Elements: Domain definitions; Aligned structures, collecting together Non-homologous Sequences; Core annotation

Previous work: Remington, Matthews ‘80; Taylor, Orengo ‘89, ‘94; Artymiuk, Rice, Willett ‘89; Sali, Blundell, ‘90; Vriend, Sander ‘91; Russell, Barton ‘92; Holm, Sander ‘93; Godzik, Skolnick ‘94; Gibrat, Madej, Bryant ‘96; Falicov, F Cohen, ‘96; Feng, Sippl ‘96; G Cohen ‘97; Singh & Brutlag, ‘98

Easy:Globins

125 res., ~1.5 Å

Tricky:Ig C & V

85 res., ~3 Å

Very Subtle: G3P-dehydro-genase, C-term. Domain >5 Å

Given 2 Structures (A & B),

2 Basic Comparison Operations

- Find an Alignment between A and B based on their 3D coordinates
2 Given an alignment optimally SUPERIMPOSE A onto B

Find Best R & T to move A onto B

Distance Matices Provide a 2D Represenation of the 3D Structure

N x N distance matrix

Antiparallel beta strands

Parallel beta strands

Helices

N dimensional space

Metric matrix

Mij = Dij2 - Dio2 - Djo2

M Eigenvectors (M = 3 for 3D structure)

- Generate Ca-Ca distance matrix for each protein A and B
- Decompose into elementary contact patterns; e.g. hexapeptide-hexapeptide submatrices
- Systematic comparisons of all elementary contact patterns in the 2 distance matrices; similar contact patterns are stored in a “pair list”
- Assemble pairs of contact patterns into larger consistent sets of pairs (alignments), maximizing the similarity score between these local structures
- A Monte-Carlo algorithm is used to deal with the combinatorial complexity of building up alignments from contact patterns
- Dali Z score - number of standard deviations away from mean pairwise similarity value

- Dali Domain Dictionary is a numerical taxonomy of all known domain structures in the PDB
- Evolves from Dali / FSSP Database
Holm & Sander, Nucl. Acid Res. 25: 231-234 (1997)

- Dali Domain Dictionary Sept 2000
- 10,532 PDB enteries
- 17,101 protein chains
- 5 supersecondary structure motifs (attractors)
- 1375 fold types
- 2582 functional families
- 3724 domain sequence families

N x N distance matrix

N dimensional space

Metric matrix

Mij = Dij2 - Dio2 - Djo2

Eigenvectors of metric matrix

Principal component analysis

Database of 498 SCOP “Folds” or “Superfamilies”

The overall pair-wise comparisons of 498 folds lead to a 498 x 498 matrix of similarity scores Sijs, where Sij is the alignment score between the ith and jth folds.

An appropriate method for handling such data matrices as a whole is metric matrix distance geometry . We first convert the similarity score matrix [Sij] to a distance matrix [Dij] by using Dij = Smax - Sij, where Smax is the maximum similarity score among all pairs of folds.

We then transform the distance matrix to a metric (or Gram) matrix [Mij] by using Mij = Dij2 - Dio2 - Djo2

where Di0, the distance between the ith fold and the geometric centroid of all N = 498 folds. The eigen values of the metric matrix define an orthogonal system of axes, called factors. These axes pass through the geometric centroid of the points representing all observed folds and correspond to a decreasing order of the amount of information each factor represents.