LSM2104/CZ2251 Essential Bioinformatics and Biocomputing Protein Structure and Visualization (2) Chen Yu Zong firstname.lastname@example.org 6874-6877
LSM2104/CZ2251 Essential Bioinformatics and Biocomputing Lecture 10 Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein domains: SCOP & CATH & DALI
1. Protein Data Bank (PDB) • Protein Data Bank: maintained by the Research Collaboratory for Structural Bioinformatics (RCSB) • http://www.rcsb.org/pdb/ • 30060 Structures 15-Mar-2005 • 27570 Structures 05-Oct-2004 • 23997 Structures 20-Jan-2004 • Also contains structures of other bio-macromolecules: DNA, carbohydrates and protein-DNA complexes.
Only deposited data is actually available Many structures not deposited in PDB, why? Structures available for soluble proteins A few dozen entries for membrane protein domains, why? X-ray data only for those proteins that crystallize well or diffract properly. Why? NMR structures are usually for small proteins How to survey the size of NMR-determined proteins? Estimated that structural data available for only 10-15% of all known proteins. Deficiencies in our structural knowledge
Protein Structure in PDB • Text files • Each entry is specified by a unique 4-letter code (PDB code): say 1HUY for a variant of GFP; 1BGK for a 37-residue toxin protein isolated from sea anemone • 1HUY and 1BGK • Header information • Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m)
Header Details • Identifies the molecule, modifications, date of release • Host organism, keywords, method of study • Authors, reference, resolution for X-ray structure • Smaller the number, better the structure. • Sequence, reference
The Atomic Coordinates • XYZ Coordinates for each atom (starting with ATOM, only heavy atom for X-ray structure) from the first residue to the last • XYZ coordinates for any ligands (starting with HETATM) complexed to the bio-macromolecule • O atoms of water molecules (starting with HETATM, normally at the last part of the xyz coordinate section) • Usually, for X-ray structure, resolution is not high enough to locate H atoms: hence only heavy atoms are shown in the PDB file. • For NMR structure, all atoms (including hydrogen atoms) are specified in the PDB file.
2. Free Software for Protein Structure Visualization • RASMOL: available for all platforms http://www.openrasmol.org • Swiss PDB Viewer: from Swiss-Prot http://www.expasy.ch/spdbv/ • Chemscape Chime Plug-in: for PC and Mac http://www.mdl.com/downloads/downloadable/index.jsp • YASARA: http://www.yasara.org/ • MOLMOL: MOLecule analysis and MOLecule display http://126.96.36.199/wuthrich/software/molmol/index.html
Ribbon representation by RasMol 1HUY An Improved Yellow Variant Of Green Fluorescent Protein From Tsien’s group J.Biol.Chem. 276 29188 (2001)
An ensemble of 15 structures (NMR, toxin Bgk); Proton atoms also included 15 backbone structures of the sea anemone toxin Bgk
15 all-atom structures of the sea anemone toxin Bgk Line representation
SCOP:Structural Classification of Proteins University of Cambridge, UK http://scop.mrc-lmb.cam.ac.uk/scop/ Hyperlink in Singapore: http://scop.bic.nus.edu.sg/ CATH:Class—Architecture—Topology --Homologous Superfamily Sequence family University College London, UK http://www.biochem.ucl.ac.uk/bsm/cath/ 3. Hierarchical classification of protein domains: SCOP & CATH
Proteins adopt a limited number of topologies More than 50,000 sequences fold into ~1000 unique folds. Homologous sequences have similar structures Usually, when sequenceidentity>30%, proteins adopt the same fold. Even in the absence of sequence homology, some folds are preferred by vastly different sequences. The “active site” is highly conserved A subset of functionally critical residues are found to be conserved even the folds are varied. Basis for protein classification
How many unique folds do organisms use to express functions? Sequence space > 50,000 Conformational space Many sequences to form one unique fold ~1,000 ???????
Structural Classification of Proteins SCOP • University of Cambridge, UK: http://scop.mrc-lmb.cam.ac.uk/scop/ • mirrored at Singapore: http://scop.bic.nus.edu.sg/ • contains PDB entries grouped hierachically by: • Structural class, • Fold, • Superfamily, • Family, • Individual member (domain-based)
Family Structural Classification of Proteins SCOP • Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin: • All proteins that have residue identities of 30% and greater; • Proteins with lower sequence identities but whose functions and structures are very similar • Example, globins with sequence identities of 15%.
Superfamily Structural Classification of Proteins SCOP • Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies • Example, actin, the ATPase domain of the heat-shock protein and hexokinase
Structural Classification of Proteins SCOP • Fold • Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections.
Structural Classification of Proteins SCOP • Class • For convenience of users, the different folds have been grouped into classes. Most of the folds are assigned to one of a few structural classes on the basis of the secondary structures of which they composed
SCOP Class: All-a topologies cytochrome b-562 ferritin
SCOP Class: All-b topologies b-barrels b sandwiches
SCOP Class: a/b Topologies a/b horseshoe
SCOP Class: a/b Topologies a/b barrels