1 / 17

Semantic Data Management for Organising Terabyte Data Archives

Semantic Data Management for Organising Terabyte Data Archives. Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg). HAMSTER-WS , Langen, 11.07.03. Content: DKRZ archive development CERA 1) concept and structure Automatic fill process CERA access statistic.

wray
Download Presentation

Semantic Data Management for Organising Terabyte Data Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) HAMSTER-WS, Langen, 11.07.03

  2. Content: • DKRZ archive development • CERA1) concept and structure • Automatic fill process • CERA access statistic 1) Climate and Environmental data Retrieval and Archiving

  3. DKRZ Archive Development

  4. Problems in file archive access: • Missing Data Catalogue • Directory structure of the Unix file system is not sufficient to organise millions of files. • Data are not stored application-oriented • Raw data contain time series of 4D data blocks. • Access pattern is time series of 2D fields. • Lack of experience with climate model data • Problems in extracting relevant information from climate model raw data files. • Lack of computing facilities at client site • Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).

  5. 5D Model Data Structure Model Raw Data Structure 6-hour Storage Interval Application-oriented Data Storage

  6. (I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are (II) Application-oriented data storage Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access Storage in standard file-format (GRIB) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

  7. Parts of CERA DB Internet Web-Based User Interface Access Catalogue Inspection Climate Data Retrieval Current database size is 15.3778 TerabyteNumber of experiments: 299Number of datasets: 19818Number of blob within CERA at 20-MAY-03: 1104798147 Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month CERA Database: CERA Database System 7.1 TB (12.2001) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: DKRZ Mass Storage Archive 210 TB neglecting Security Copies (12.2001) DB Size 09.07.03: 19.6 TB

  8. Experiment Description Pointer to Unix-Files Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table Data Structure in CERA DB Level 1 - Interface: Metadata entries (XML, ASCII) Level 2 – Interf.: Separate files containing BLOB table data

  9. Primary Data Processing Climate Model Raw Data Application-oriented Data Storage (Interface level 2)

  10. Automatic Fill Process (AFP) Creation of application-oriented data storage must be automatic !!!

  11. Archive Data Flow per month 50 TB 75 TB (2006+) Common File System Mass Storage Archive Unix-Files Compute Server Application Oriented Data Hierarchy Unix-Files 2003: 4 TB 2004: 10 TB 2005: 17 TB 2006+: 25 TB Application Oriented Data Hierarchy Important: Automatic fill process has to be performed before corresponding files migrate to mass storage archive. CERA DB System Metadata Initialisation

  12. Automatic Fill Process Steps and Relations • DB-Server: • Initialisation of CERA DBMetadata and BLOB data tables are created • DB Server: • BLOB data table input accessed from DB fill cache • BLOB table injection and update of metadata • Step 2 repeated until table partition is filled (BLOB table fill cache) • Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2) • Close entire table and update metadata after end of model experiment • Compute Server: • Climate model calculation starts with 1. month • Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in thedynamic DB fill cache • Step 2 repeated until end of model experiment

  13. Automatic Fill Process Development ca. 60 TB 4 TB/month GFS AFP development with 1 TB dynamic DB fill cache • Oracle RDBMS is • currently not • connected with GFS. • Future: • RDBMS movement • to Linux system and • connection with GFS, • AFP is planned to move • to DB-server 192 CPU's 1.5 TB Memory BLOB table fill cache

  14. AFP Cache Sizes • Dynamic DB fill cache • In order to guarantee stable operation the fill cache should buffer data from 10 days production. • Cache size is determined by the automatic data fill rate.

  15. AFP Cache Sizes • BLOB table partition cache • Number of parallel climate model calculations and size of BLOB table partitions determine the cache size. • BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions • Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache. • Modification will be inferred if number of parallel runs and/or table partitions change.

  16. CERA Access Statistic

  17. CERA DB using countries

More Related