1 / 22

3D databases and data warehouse technology

3D databases and data warehouse technology. Overall Strategy Terms and background Populating the databases Clean up processes How can I use the database? What next. Overview. By the term ‘database’ we refer to the system rather than the data Indexed file space

minty
Download Presentation

3D databases and data warehouse technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3D databases and data warehouse technology

  2. Overall Strategy Terms and background Populating the databases Clean up processes How can I use the database? What next Overview

  3. By the term ‘database’ we refer to the system rather than the data Indexed file space Also used as a shorthand for a database management system (DBMS) Methods for accessing and changing data Controls for referential integrity What is a database?

  4. Data fields in a normalised database appear only once Normalisation RESIDUE COMPONENT CHAIN CHAIN ID SEQ COMP ID ID attr ID attr A 1 ASP ASP -1 A 185 A 2 LYS LYS +1 ... ... ... ... ... ... ... • Data fields in a denormalised database are repeated in different places RESIDUE COMPONENT CHAIN COMPattr CHAIN ID SEQ COMP ID CHAINattr ID attr ID attr -1 A 1 ASP 185 ASP -1 A 185 A 2 LYS 185 +1 LYS +1 ... ... ... ... ... ... ... ... ...

  5. Structural hierarchy assembly molecule (entity) chain residue

  6. ASU and assemblies assembly ASU chain chain residues residues

  7. The pipeline archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

  8. The first steps archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

  9. A series of scripts Parses non-standard header records Fills in chain identifiers Outputs a first cut clean file Manual editing ~1000 entries require manual editing The result is a PDB format file that can be passed to the subsequent automatic steps The first steps

  10. bizarre errors … 1ew1 ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 -2.985 ... ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 2.985 ...

  11. automatic processing archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

  12. Automatic cleanup (d2c) Incorporates quaternary structure information Runs a lot of checks and corrections Outputs mmCIF file Loading Metadata-driven custom loader Load through views with insert triggers Many heuristics also applied to data within these triggers process details

  13. Using reference data $COLI COLI E. COLI E.COLI ESCHERCHIA COLI ESCHERICHI $COLI ESCHERICHIA $ COLI ESCHERICHIA $COLI ESCHERICHIA COLI ESCHERICHIA COLI. EXCHERICHIA COLI EXPRESCHERICHIA COLI • Variations in legacy data • Hinders accurate searches • Hinders links to other services • Match data against controlled vocabularies • Within scripts • Within database during load • Semi-automated • Use string matching algorithms • Effective when controlled vocabulary well maintained

  14. More difficult to dealwith Where coordinates and nomenclature do not agree, have to make a judgement on which, if either, are correct We maintain a curated database of compounds, against which legacy data is compared atom nomenclature – ongoing; relatively easy to correct where the compound has been correctly identified Stereochemistry – may indicate that the compound name is incorrect Chemical Components

  15. Ligands are often named inconsistently or even entirely incorrectly, e.g. a-D-mannose (MAN) vs. b-D-mannose (BMA) Errors are detected using a graph-based structure comparison algorithm Ligand nomenclature MAN BMA

  16. not all cases resolvable 1d7t DTY 4 in chain A, model 1 - is it D or L ?? HEADER DE NOVO PROTEIN 19-OCT-99 1D7T TITLE NMR STRUCTURE OF AN ENGINEERED CONTRYPHAN CYCLIC PEPTIDE TITLE 2 (MOTIF CPXXPXC) ... MODRES 1D7T DTY A 4 TYR D-TYROSINE ... HET DTY A 4 21 ... HETNAM DTY D-TYROSINE ... FORMUL 1 DTY C9 H11 N1 O3

  17. post-load processing archive PDB services pdb data warehouse edited PDB archiveDB pdb cif post-load processes manual edit distribution

  18. Involved in deriving data and building crosslinks to other services Geometric information Analysing non-polymer components and assembling full entities from individual components Links to taxonomy and sequence databases process details

  19. transformation to DW archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

  20. Set of SQL scripts Supports Oracle (routinely) and MySQL (development) Periodically undertake full transform takes a couple of weeks Provide weekly incremental patches much faster Supports transforms into different data marts process details

  21. Continuing cleanup HET group curation Sequence cross-references Citations More choice on downloads Data marts (even single tables) Groups of entries Release of clean PDB files (end 2006) coming soon …

  22. who did what archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

More Related