Bioinformatics Subtopics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

Fold Recognition PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

Bioinformatics Subtopics. Secondary Structure Prediction. Fold Recognition. Homology Modeling. Docking & Drug Design. E-literature. Function Class- ification. Sequence Alignment. Expression Clustering. Protein Geometry. Database Design. Gene Prediction. Large-Scale Genomic Surveys.

Download Presentation

Fold Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fold recognition

Bioinformatics Subtopics

SecondaryStructurePrediction

FoldRecognition

HomologyModeling

Docking &Drug Design

E-literature

FunctionClass-ification

SequenceAlignment

Expression Clustering

ProteinGeometry

DatabaseDesign

GenePrediction

Large-ScaleGenomicSurveys

Structural

Informatics

StructureClassification

GenomeAnnotation


Some specific informatics tools of bioinformatics

Databases

NCBI GenBank- Protein and DNA sequence

NCBI Human Map - Human Genome Viewer

NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito

TIGR - The Institute for Genome Research

SwissProt - Protein Sequence and Function

ProDom - Protein Domains

Pfam - Protein domain families

ProSite - Protein Sequence Motifs

Protein Data Base (PDB) - Coordinates for Protein 3D structures

SCOP Database- Domain structures organized into evolutionary families

HSSP - Domain database using Dali

FlyBase

WormBase

PubMed / MedLine

Sequence Alignment Tools

BLAST

Clustal MSAs

FASTA

PSI-Blast

Hidden Markov Models

3D Structure Alignments / Classifications

Dali

VAST

PRISM

CATH

SCOP

Some Specific “Informatics” toolsof Bioinformatics


Some specific informatics tools of bioinformatics1

Databases

NCBI GenBank- Protein and DNA sequence

NCBI Human Map - Human Genome Viewer

NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito

TIGR - The Institute for Genome Research

SwissProt - Protein Sequence and Function

ProDom - Protein Domains

Pfam - Protein domain families

ProSite - Protein Sequence Motifs

Protein Data Base (PDB) - Coordinates for Protein 3D structures

SCOP Database- Domain structures organized into evolutionary families

HSSP - Domain database using Dali

CATH Database

FlyBase

WormBase

PubMed / MedLine

Sequence Alignment Tools

BLAST

Clustal MSAs

FASTA

PSI-Blast

Hidden Markov Models

3D Structure Alignments / Classifications

Dali

CATH

SCOP

VAST

PRISM

Some Specific “Informatics” toolsof Bioinformatics


Dynamic programming algorithm alternate tracebacks correspond to alternative alighments

Dynamic Programming Algorithm: Alternate TracebacksCorrespond to Alternative Alighments

A B C - N Y R Q C L C R - P MA Y C Y N - R - C K C R B P


Sequence similarity may miss functional homologies which can be detected by 3d structural analysis

}

“Twilight zone”

Sequence Similarity May Miss Functional Homologies Which Can Be Detected by3D Structural Analysis

% Sequence Identity

Homologous 3D Structure

Non-homologous

3D Structure

Residues Aligned

Adapted from Chris Sander


Structural validation of homology

Structural Validation of Homology

19% Seq ID

Z = 12.2

Adenylate Kinase Guanylate Kinase


Fold recognition

Asp tRNA Synthetase

Staphylococcal Nuclease

CspA

Gene 5 ssDNA Binding Protein

Topoisomerase I

CspB


Fold recognition

Protein Domains

“Independent Folding Units”

50 - 350 residues

Mean size - 125 residues

Alpha folds; Beta Folds;

Alpha+Beta Folds; Alpha/Beta Folds


Fold recognition

COG 272, BRCT family

P. Bork et al


Fold recognition

Cadherins

courtesy of C. Chothia


Principal protein fold classes

Principal Protein Fold Classes

All beta

All alpha

alpha + beta

alpha / beta


Fold recognition

  • Classification of Protein Folds

  • SCOP

  • CATH

  • DALI / FSSP


Fold recognition

Most proteins in biology have been

produced by the duplication,

divergence and recombination of

the members of a small number of

protein families.

courtesy of C. Chothia


Fold recognition

Average Domain Size: 170 residues

courtesy of C. Chothia


Scop protein fold hierarchy manually curated database of domain structures

SCOP - Protein Fold HierarchyManually Curated Database of Domain Structures

Class - 5

Fold - ~500

Superfamily - ~ 700

Family ~ 1000

Family - domains with common evolutionary origin

Homology: Derived by evolutionary divergence


Five principal fold classes

Five Principal Fold Classes

All a folds

All b folds

a + b folds

a / b folds

small irregular folds


Fold recognition

courtesy of C. Chothia


Fold recognition

courtesy of C. Chothia


Fold recognition

courtesy of C. Chothia


Fold recognition

courtesy of C. Chothia


Fold recognition

courtesy of C. Chothia


Fold recognition

courtesy of C. Chothia


Cath protein domain database partially automatic fold classificaiton

CATH Protein Domain DatabasePartially Automatic Fold Classificaiton

CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108.

Pearl, F.M.G, Lee, D., Bray, J.E, Sillitoe, I., Todd, A.E., Harrison, A.P., Thornton, J.M. and Orengo, C.A. (2000) Assigning genomic sequences to CATH Nucleic Acids Research. Vol 28. No 1. 277-282


Cath protein domain database partially automatic fold classification

CATH Protein Domain DatabasePartially Automatic Fold Classification

Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically.

Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually.

The topology level clusters structures according to their topological connections and numbers of secondary structures.

The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.


Fold recognition

Representations of Protein Stuctures

a - full atom

b,c - strands / helices

d - Topology diagrams


Structural alignment of two globins

Structural Alignment of Two Globins


Structure based sequence alignments

Hb

Structure-based Sequence Alignments

Alignment of Individual Structures

Fusing into a Single Fold “Template”

Mb

Hb VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAV

||| .. | |.|| | . | . | | | | | | | .| .| || | || .

Mb VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAIL

Hb AHVD-DMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------

| | . || | .. . .| .. | |..| . . | | . ||.

Mb KK-KGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG

Elements: Domain definitions; Aligned structures, collecting together Non-homologous Sequences; Core annotation

Previous work: Remington, Matthews ‘80; Taylor, Orengo ‘89, ‘94; Artymiuk, Rice, Willett ‘89; Sali, Blundell, ‘90; Vriend, Sander ‘91; Russell, Barton ‘92; Holm, Sander ‘93; Godzik, Skolnick ‘94; Gibrat, Madej, Bryant ‘96; Falicov, F Cohen, ‘96; Feng, Sippl ‘96; G Cohen ‘97; Singh & Brutlag, ‘98


Some similarities are readily apparent others are more subtle

Some Similarities are Readily Apparent others are more Subtle

Easy:Globins

125 res., ~1.5 Å

Tricky:Ig C & V

85 res., ~3 Å

Very Subtle: G3P-dehydro-genase, C-term. Domain >5 Å


Automatically comparing protein structures

Automatically Comparing Protein Structures

Given 2 Structures (A & B),

2 Basic Comparison Operations

  • Find an Alignment between A and B based on their 3D coordinates

    2 Given an alignment optimally SUPERIMPOSE A onto B

    Find Best R & T to move A onto B


Fold recognition

Distance Matices Provide a 2D Represenation of the 3D Structure


Explain concept of distance matrix on blackboard

Explain Concept of Distance Matrix on Blackboard

N x N distance matrix

Antiparallel beta strands

Parallel beta strands

Helices

N dimensional space

Metric matrix

Mij = Dij2 - Dio2 - Djo2

M Eigenvectors (M = 3 for 3D structure)


Fold recognition

DALI: Protein Structure Comparison by Alignment of Distance MatricesL. Holm and C. Sander J. Mol. Biol. 233: 123 (1993)

  • Generate Ca-Ca distance matrix for each protein A and B

  • Decompose into elementary contact patterns; e.g. hexapeptide-hexapeptide submatrices

  • Systematic comparisons of all elementary contact patterns in the 2 distance matrices; similar contact patterns are stored in a “pair list”

  • Assemble pairs of contact patterns into larger consistent sets of pairs (alignments), maximizing the similarity score between these local structures

  • A Monte-Carlo algorithm is used to deal with the combinatorial complexity of building up alignments from contact patterns

  • Dali Z score - number of standard deviations away from mean pairwise similarity value


Dali domain dictionary deitman park notredame heger lappe and holm nucleic acids res 29 5557 2001

Dali Domain DictionaryDeitman, Park, Notredame, Heger, Lappe, and Holm Nucleic Acids Res. 29: 5557 (2001)

  • Dali Domain Dictionary is a numerical taxonomy of all known domain structures in the PDB

  • Evolves from Dali / FSSP Database

    Holm & Sander, Nucl. Acid Res. 25: 231-234 (1997)

  • Dali Domain Dictionary Sept 2000

    • 10,532 PDB enteries

    • 17,101 protein chains

    • 5 supersecondary structure motifs (attractors)

    • 1375 fold types

    • 2582 functional families

    • 3724 domain sequence families


Explain concept of distance matrix on blackboard1

Explain Concept of Distance Matrix on Blackboard

N x N distance matrix

N dimensional space

Metric matrix

Mij = Dij2 - Dio2 - Djo2

Eigenvectors of metric matrix

Principal component analysis


A global representation of protein fold space hou sims zhang kim pnas 100 2386 2390 2003

A Global Representation of Protein Fold SpaceHou, Sims, Zhang, Kim, PNAS 100: 2386 - 2390 (2003)

Database of 498 SCOP “Folds” or “Superfamilies”

The overall pair-wise comparisons of 498 folds lead to a 498 x 498 matrix of similarity scores Sijs, where Sij is the alignment score between the ith and jth folds.

An appropriate method for handling such data matrices as a whole is metric matrix distance geometry . We first convert the similarity score matrix [Sij] to a distance matrix [Dij] by using Dij = Smax - Sij, where Smax is the maximum similarity score among all pairs of folds.

We then transform the distance matrix to a metric (or Gram) matrix [Mij] by using Mij = Dij2 - Dio2 - Djo2

where Di0, the distance between the ith fold and the geometric centroid of all N = 498 folds. The eigen values of the metric matrix define an orthogonal system of axes, called factors. These axes pass through the geometric centroid of the points representing all observed folds and correspond to a decreasing order of the amount of information each factor represents.


A global representation of protein fold space hou sims zhang kim pnas 100 2386 2390 20031

A Global Representation of Protein Fold SpaceHou, Sims, Zhang, Kim, PNAS 100: 2386 - 2390 (2003)


  • Login