sameer velankar http msd ebi ac uk l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sameer Velankar http://msd.ebi.ac.uk PowerPoint Presentation
Download Presentation
Sameer Velankar http://msd.ebi.ac.uk

Loading in 2 Seconds...

play fullscreen
1 / 36

Sameer Velankar http://msd.ebi.ac.uk - PowerPoint PPT Presentation


  • 152 Views
  • Uploaded on

Formats of Structural data. Sameer Velankar http://msd.ebi.ac.uk. Data Format. ‘A precise specification of how to write data to a file’ E.g. PDB: ATOM 230 C GLY A 34 23.947 85.223 129.533 1.00 14.43 C Data Formats can be hard to change once made

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sameer Velankar http://msd.ebi.ac.uk' - jensen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sameer velankar http msd ebi ac uk

Formats of Structural data

Sameer Velankar

http://msd.ebi.ac.uk

slide2

Data Format

‘A precise specification of how to write data to a file’

E.g. PDB:

ATOM 230 C GLY A 34 23.947 85.223 129.533 1.00 14.43 C

Data Formats can be hard to change once made

You can convert between formats

provided the underlying data model is the same

slide3

Formats appear and disappear like meteoroids even before we hear about them. We have (enough!) standards. We are happy and proud of the formats we use now and we do not understand why others do not like them!!

slide4

New data from:

NMR spectroscopy,

X-ray crystallography

Electron microscopy

XML

HTML

Flat-files

Harvesting

‘Manual’

Data Delivery

Data capture

Data Storage

Relational Database

Clean-up

CORBA

Oracle

Legacy data

(old PDB)

slide5

New data from:

NMR spectroscopy,

X-ray crystallography

Electron microscopy

XML

HTML

Flat-files

Harvesting

‘Manual’

Data Delivery

Data capture

Data Storage

Relational Database

Clean-up

CORBA

Oracle

Legacy data

(old PDB)

slide6

New data from:

NMR spectroscopy,

X-ray crystallography

Electron microscopy

XML

HTML

Flat-files

Harvesting

‘Manual’

Data Delivery

Data capture

Data Storage

Relational Database

CleanUp

CORBA

Oracle

Legacy data

(old PDB)

slide7

Annotated PDB entry

HEADER

HEADER OUTER MEMBRANE 29-JUN-99 1QJP

TITLE HIGH RESOLUTION STRUCTURE OF THE OUTER MEMBRANE PROTEIN A

TITLE 2 (OMPA) TRANSMEMBRANE DOMAIN

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: OUTER MEMBRANE PROTEIN A;

COMPND 3 CHAIN: A;

COMPND 4 FRAGMENT: TRANSMEMBRANE DOMAIN;

COMPND 5 ENGINEERED: YES;

COMPND 6 MUTATION: YES

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI;

SOURCE 3 STRAIN: BL21DE3;

SOURCE 4 PLASMID: PET3B-171;

SOURCE 5 GENE: OMPA;

SOURCE 6 EXPRESSION_SYSTEM: ESCHERICHIA COLI;

SOURCE 7 EXPRESSION_SYSTEM_STRAIN: BL21DE3;

SOURCE 8 EXPRESSION_SYSTEM_PLASMID: PET3B-171

KEYWDS OUTER MEMBRANE

EXPDTA X-RAY DIFFRACTION

AUTHOR A.PAUTSCH,G.E.SCHULZ

REVDAT 1 30-JUN-00 1QJP 0

JRNL AUTH A.PAUTSCH,G.E.SCHULZ

JRNL TITL HIGH RESOLUTION STRUCTURE OF THE OMPA MEMBRANE

JRNL TITL 2 DOMAIN

JRNL REF J.MOL.BIOL. V. 298 273 2000

JRNL REFN ASTM JMOBAK UK ISSN 0022-2836

slide8

Annotated PDB entry

HEADER

REMARK 3

REMARK 3 REFINEMENT.

REMARK 3 PROGRAM : REFMAC

REMARK 3 AUTHORS : MURSHUDOV,VAGIN,DODSON

REMARK 3

REMARK 3 DATA USED IN REFINEMENT.

REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.65

REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 12

REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.0

REMARK 3 COMPLETENESS FOR RANGE (%) : 95.4

REMARK 3 NUMBER OF REFLECTIONS : 29702

REMARK 3

REMARK 3 FIT TO DATA USED IN REFINEMENT.

REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT

REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM

REMARK 3 R VALUE (WORKING + TEST SET) : NULL

REMARK 3 R VALUE (WORKING SET) : 0.155

REMARK 3 FREE R VALUE : 0.198

REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.0

REMARK 3 FREE R VALUE TEST SET COUNT : NULL

slide9

Annotated PDB entry

HEADER

REMARK 200 – Information about the experiment

REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION

REMARK 200 DATE OF DATA COLLECTION : 15-APR-1998

REMARK 200 TEMPERATURE (KELVIN) : 100

REMARK 200 PH : 5.0

REMARK 200 NUMBER OF CRYSTALS USED : 1

REMARK 300/350 – Assembly information

REMARK 300 BIOMOLECULE: 1

REMARK 300 THIS ENTRY CONTAINS THE CRYSTALLOGRAPHIC ASYMMETRIC UNIT

REMARK 300 WHICH CONSISTS OF 1 CHAIN(S). SEE REMARK 350 FOR

REMARK 300 INFORMATION ON GENERATING THE BIOLOGICAL MOLECULE(S).

REMARK 300

REMARK 300 BIOLOGICAL_UNIT: MONOMER

REMARK 350

REMARK 350 GENERATING THE BIOMOLECULE

REMARK 350 COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN

REMARK 350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE

REMARK 350 MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS

REMARK 350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND

REMARK 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.

REMARK 350

REMARK 350 BIOMOLECULE: 1

REMARK 350 APPLY THE FOLLOWING TO CHAINS: A

REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000

REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000

REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000

slide10

Annotated PDB entry

HEADER

DBREF 1QJP A 1 171 SWS P02934 OMPA_ECOLI 22 192

SEQADV 1QJP PHE A 23 SWS P02934 LEU 44 ENGINEERED MUTATION

SEQADV 1QJP LYS A 34 SWS P02934 GLN 55 ENGINEERED MUTATION

SEQADV 1QJP TYR A 107 SWS P02934 LYS 128 ENGINEERED MUTATION

SEQRES 1 A 171 ALA PRO LYS ASP ASN THR TRP TYR THR GLY ALA LYS LEU

SEQRES 2 A 171 GLY TRP SER GLN TYR HIS ASP THR GLY LEU ILE ASN ASN

SEQRES 3 A 171 ASN GLY PRO THR HIS GLU ASN LYS LEU GLY ALA GLY ALA

SEQRES 4 A 171 PHE GLY GLY TYR GLN VAL ASN PRO TYR VAL GLY PHE GLU

SEQRES 5 A 171 MET GLY TYR ASP TRP LEU GLY ARG MET PRO TYR LYS GLY

SEQRES 6 A 171 SER VAL GLU ASN GLY ALA TYR LYS ALA GLN GLY VAL GLN

SEQRES 7 A 171 LEU THR ALA LYS LEU GLY TYR PRO ILE THR ASP ASP LEU

SEQRES 8 A 171 ASP ILE TYR THR ARG LEU GLY GLY MET VAL TRP ARG ALA

SEQRES 9 A 171 ASP THR TYR SER ASN VAL TYR GLY LYS ASN HIS ASP THR

SEQRES 10 A 171 GLY VAL SER PRO VAL PHE ALA GLY GLY VAL GLU TYR ALA

SEQRES 11 A 171 ILE THR PRO GLU ILE ALA THR ARG LEU GLU TYR GLN TRP

SEQRES 12 A 171 THR ASN ASN ILE GLY ASP ALA HIS THR ILE GLY THR ARG

SEQRES 13 A 171 PRO ASP ASN GLY MET LEU SER LEU GLY VAL SER TYR ARG

SEQRES 14 A 171 PHE GLY

slide11

Annotated PDB entry

HEADER

HET C8E 181 21

HET C8E 182 21

HET C8E 183 21

HET C8E 184 21

HET C8E 185 21

HET C8E 186 21

HETNAM C8E (HYDROXYETHYLOXY)TRI(ETHYLOXY)OCTANE

HETSYN C8E N-OCTYL TETRAOXYETHYLENE

FORMUL 2 C8E 6(C16 H34 O5)

FORMUL 3 HOH *66(H2 O1)

SHEET 1 A 9 LYS A 34 GLN A 44 0

SHEET 2 A 9 THR A 6 SER A 16 -1 N SER A 16 O LYS A 34

SHEET 3 A 9 MET A 161 ARG A 169 -1 N TYR A 168 O THR A 9

SHEET 4 A 9 ILE A 135 THR A 144 -1 N GLN A 142 O MET A 161

SHEET 5 A 9 LYS A 113 THR A 132 -1 N THR A 132 O ILE A 135

SHEET 6 A 9 LEU A 91 TYR A 107 -1 N THR A 106 O ASN A 114

SHEET 7 A 9 TYR A 72 PRO A 86 -1 N TYR A 85 O ILE A 93

SHEET 8 A 9 VAL A 49 ARG A 60 -1 N LEU A 58 O ALA A 74

SHEET 9 A 9 LYS A 34 ASN A 46 -1 N ASN A 46 O VAL A 49

slide12

Annotated PDB entry

ATOMS

ATOM 1 N ALA A 1 40.685 9.235 39.250 1.00 41.73 N

ANISOU 1 N ALA A 1 5189 5244 5422 1517 -755 251 N

ATOM 2 CA ALA A 1 40.437 10.616 39.731 1.00 34.21 C

ANISOU 2 CA ALA A 1 3852 5690 3456 978 203 -95 C

ATOM 3 C ALA A 1 41.700 11.455 39.655 1.00 33.90 C

ANISOU 3 C ALA A 1 2833 5874 4172 1554 210 -15 C

ATOM 4 O ALA A 1 42.811 10.921 39.582 1.00 34.80 O

ANISOU 4 O ALA A 1 3211 5948 4064 1990 418 301 O

ATOM 5 CB ALA A 1 40.003 10.537 41.194 1.00 35.34 C

ANISOU 5 CB ALA A 1 4077 5523 3827 28 458 466 C

ATOM 6 N PRO A 2 41.565 12.766 39.609 1.00 32.57 N

ANISOU 6 N PRO A 2 2775 5735 3866 1145 240 252 N

ATOM 7 CA PRO A 2 42.672 13.694 39.684 1.00 32.36 C

ANISOU 7 CA PRO A 2 2759 6152 3384 1006 460 57 C

ATOM 8 C PRO A 2 43.515 13.362 40.915 1.00 35.79 C

ANISOU 8 C PRO A 2 3145 6418 4037 443 30 714 C

components of pdb file
Components of PDB file
  • HEADER RECORDS
  • COORDINATE RECORDS
header records
Header Records

The PDB format does not enforce multiple relationships between the records

e.g. in COMPND record the CHAIN names do not always match those in SEQRES, SITE, TURN, or even ATOM records

what is xml
WHAT IS XML?
  • XML is a markup language for documents containing structured information.
  • Structured information contains both content (COMPND,SOURCE,ATOM) and some indication of what role that content plays (for example, SOURCE record gives information about source of the molecule, expression system etc.)
  • A markup language is a mechanism to identify structures in a document. The XML specification defines a standard way to add markup to documents.
slide23

Target Data Formats

XML :

<residue name=“TYR” SEQID=“108”>

<atom name=“HB1”>

</atom>

</residue>

NMR-STAR (Self-defining Text Archive and Retrieval Format)

SQL database

information related to structure
Information related to structure

Emergence of new techniques for determining structures or validation of the structure means that the HEADER records need to adapt to integrate the new information.

e.g.

FREE R VALUE : 0.198

It was not possible to refine structure with anisotropic B factors.

May want to include cross references to new databases (GO- gene ontology)

the present archive pdb style file format is also difficult to search against for complex queries

The present archive (PDB style file format) is also difficult to search against for complex queries!!

slide26

How do we create a storage

mechanism which is flexible enough to adapt to new changes and at the same time can be widely used for complex searches or data mining?

New data from:

NMR spectroscopy,

X-ray crystallography

Electron microscopy

XML

HTML

Flat-files

Harvesting

‘Manual’

Data Delivery

Data capture

Data Storage

Relational Database

Clean-up

CORBA

Oracle

Legacy data

(old PDB)

data base tasks
Data Base Tasks
  • GET DATA
  • ORGANISE AND STORE DATA
  • GIVE IT BACK
what does an rdbms provide
What does an RDBMS provide?
  • efficient database management
  • non-procedural
  • maintain data in an organised form
  • reading and writing data to the computer
  • fast data access mechanisms
  • reduce or eliminate need for redundant data
  • ensure integrity and consistency of data
example tables and relationships
Example Tables and Relationships

DEPOSITION

* CREATION_DATE

REF_CONTACT_NAME

* LAST_UPDATE

* EMAIL

* TITLE

* NAME_FAMILY

o ACCESSION_CODE

o BUILDING

o DATE_ALL_ARRIVED_DATE

o DEPARTMENT_1

o DEP_PASSWORD

o DEPARTMENT_2

o DETAILS_NDB

o FAX

o HARVEST_PROJECT_NAME

o NAME_GIVEN

o HOLD_DATE

o NAME_INITIALS

o HTTP_REFERER

o NAME_SUFFIX

o HTTP_USER_AGENT

o PHONE

o IMMUNE_RELATED

o TITLE

o RELEASE_DATE

o URL

o VIRUS_FLAG

example of sql
Example of SQL

create table DEP_ENTITY (

DE_ID number(10,0) not null,

DEP_DEP_ID number(10,0) not null,

ID varchar2(255) not null,

NAME varchar2(255) not null,

DE_TYPE varchar2(10) not null,

SYNTHETIC char(1) default 'N' not null,

DETAILS varchar2(2000) null,

SYSTEM varchar2(255) null,

SYSTEMATIC_NAME varchar2(80) null,

COMMON_NAME varchar2(80) null,

FORMULA_WGHT number(8,0) null,

ENGINEERED varchar2(255) null,

MUTANT_FLAG varchar2(255) null,

FRAGMENT_FLAG varchar2(255) null,

MUTATION_STRING varchar2(40) null,

) ;

slide32

Normalised database

~ 410 tables, 2000 attributes

deposition data

~ 260 tables, 1300 attributes

is used to

calculate

provides standard

values for

reference data

~ 130 tables, 500 attributes

derived data

~ 20 tables, 200 attributes

slide33

New data from:

NMR spectroscopy,

X-ray crystallography

Electron microscopy

XML

HTML

Flat-files

Harvesting

‘Manual’

Data capture

Data Storage

Data Delivery

Relational Database

Clean-up

CORBA

Oracle

Legacy data

(old PDB)

slide35

Funding and Partners of the EBI-Macromolecular Structure Database Group (EBI-MSD)

CCPNCambridge(UK)

Ehrlangen(Germany)

Aarhous (Denmark)

CCP4Daresbury(UK)

Brussels(Belgium)

CCP4Daresbury(UK)

Nijmegen

(NL)

Frankfurt(Germany)

Uppsala(Sweden)

CNBMadrid(Spain)

Oxford

(UK)

York(UK)

CNRS(Paris)

CNBMadrid(Spain)

AUTO-STRUCT

EU FPV (0)

IIMSEU FPV(1)

TEMBLOREU FPV(12)

EMBLHamburg(Germany)

Nijmegen(NL)

MRCCambridge(UK)

NMRQUALEU FPV(1)

Cambridge

(UK)

EBI-MSD Group

EMBL(4)

WT(5)

BBSRC(1)

CCP4(0.5)

Utrecht(NL)

EMBLHeidelberg(Germany)

RCSBUSA

Glaxo

Wellcome

BMRBUSA

CCP4Daresbury(UK)

UCLLondon(UK)

Zeneca Ltd

Azara Ltd

BrukerAnalytica

Key:Circles represent coordinating partners. Numbers in brackets are the number of staff supported. (e.g. AUTOSTRUCT supports no staff at EBI)RCSB: Research Collaboratory for Structural BioinformaticsBMRB: BioMagResBank

EMBL: European Molecular Biology LaboratoryWT: Wellcome TrustBBSRC: UK Biotechnology and Biological Sciences Research CouncilCCP4: UK Collaborative Computational Project 4.EU FPV: European Union, Framework Five.