WS Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences

World Data Center Climate: Terabyte Data Storage in a Relational Database System Michael Lautenschlager, Hannes Thiemann and Frank Toussaint ICSU World Data Center Climate Model and Data / Max-Planck-Institute for Meorology Hamburg, Germany WS Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences Edinburgh, November 1st + 2nd, 2005 WDCC Home: www.wdcc-climate.de / WDCC Contact: data@dkrz.de

Content: Introduction of WDCC CERA2 Data Model Data Access Connection to Mass Storage Archive Summary

IPCC WOCE GEBCO BALTEX HOAPS CEOP COSMOS CARIBIC EH5/MPI-OM IPCC-AR4 ERA15/40 NCEP Simulations @ MPI, GKSS,… WDCC Content Oktober 2005: 580 Experiments / 68.000 Data Sets Data from Earth System Modelling and Related Observations ERA40 Start: Approved in January 2003 Maintenance: Model and Data (M&D/MPI-M) and German Climate Computing Centre (DKRZ)

WDCC Access

WDCC Size 4.6 Billion BLOBs

WDCC DB Storage time levels parameters days/4 levels parameters days/4 Storage of global coverages per file or BLOB : all levels, all parameters arbitrary time intervals all levels, all parameters 1 moment (6 by 6 hours) 1 level, 1 parameter 1 moment (= 1 BLOB = 1 global field) how we get the grid data:Files from climate model postprocessing step 1: homogenizing time and calculation of diagnostics postprocessing step 2: isolation of levels & parameters and creation of BLOB table input

Data Model

(I) Data catalogue and Unix files (pointer or BLOB-table-entry) Enable search and identification of data Allow for data access as they are (coarse granularity) (II) Application-oriented data storage Time series of individual variables are stored as BLOB entries in DB Tables (fine granularity) Allow for fast and selective data access Storage in standard data format (GRIB, NetCDF) Allow for application of standard data processing routines (PINGOs, CDOs) CERA1) Concept: Semantic Data Management 1)Climate and Environmental data Retrieval and Archiving

Experiment Description Unix-Files Table / Pointer Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table WDCC Data Topology Level 1 - Interface: Metadata entries (XML, ASCII) + Data Files Level 2 – Interf.: Separate files containing BLOB table data in application adapted structure (time series of single variables) BLOB DB Table corresponds to scalable, virtual file at the operating system level.

Contact Coverage Reference Entry Status Parameter Spatial Reference Distribution Local Adm. Data Org Data Access CERA Data Model

CERA Modules • 3 Modules: • DATA_ACCESSfor automatted data access ( remote data access) • DATA_ORGorganization of grid data( geo-references of grid points in BLOBs) • CODEmatching of (internal) model code numbers

The CERA2 data model … allows for data search according to discipline, keyword, variable, project, author, geographical region and time interval and for data retrieval. allows for specification of data processing (aggregation and selection) without attaching the primary data. is flexible with respect to local adaptations, to storage of different types of geo-referenced data, and to definition of data topologies (hierarchical, network, ….). is open for cooperation and interchange with other database systems (e.g. FGDC metadata standard and ISO 19115 included). But: is not the simplest data model for each single application. Data Model Functions

Data Access

Web Access to WDCC

Interactive Catalogue Access lnternet Application Server web browser request: URL dynamic html pages http: html Servlet / JSP • Catalogue access via WWW • URL parsed by JSP • integrated DB retrieval by JSP • response in standard html • efficient administration of detailed meta information

HTTP and JDBC Data Download lnternet Application Server Data download via WWW web browser request: html form write to client disk http: file download Servlet / JSP • request handeled by JSP • return of binary file • standard client side jdbc retrieval • return of binary file Data download via script/batch progr. „jblob“ request: jdbc jdbc file download write to client disk

XML Interface for http Metadata Output lnternet Application Server user applications request: URL raw xml xhtml xsl – mapping ISO xml http: XML xsql –query DC xml ... various metadata formats see wini.wdc-climate.de • Metadata access via WWW: • xsql query to DB • xml output from DB • xsl mapping to any metadata format

http Data Output lnternet Application Server request: URL user applications plain ASCII html tables binary objects http: plain, bin, html Java Servlet . . . various data formats • Data access via WWW • URL parsed by servlet • query: DB access by jdbc • response in any format

Connection to Mass Storage Archive

Tapes Disks Oracle DBMS + HSM DXDB: Unitree client on DB machines for communication between Oracle DB and tape archive

DXDB is used for Ordinary Oracle datafiles Redo logs Backup Use of DXDB

Migin Migout dxdb TBS - RW TBS - RW TBS - RO All tablespaces are moved “at once” to dxdb Tbl Partition 1 Tbl Partition 2 Tbl Partition 1

Migout takes place after files haven’t been modified for x minutes Only one migout process per dxdb-filesystem Migin takes place immediately after a file is requested. Only parts accessed are retrieved from the backend storage. One migin process per requested file. Migout / Migin

dxdb HWM LWM Purging

It works It’s fast Applications don’t have to wait until files are completely restored from tapes. Pro

It works Dxdb not supported by Oracle Oracle's officially supported Backend requirements do not necessarily match requirements from other applications like HSM systems (i.e. connection to Unitree is not standarised). Contra - If the backend works

Summary • Efficient handling of detailed metadata • easy and structured administration of > 60 metadata tables • access support:Java Server Pages (JSP), Servlets, jdbc, xsqlincluding standard DB features (sql, views, triggers, ... ) • Efficient handling of fine granularity data • random access to arbitrary time steps of single parameters • access support:Java Server Pages (JSP), Servlets, jdbcincluding standard DB features (authorisation, ... ) • transparent migration of bulk data to tape

The Winter TopTen Program identifies the world’s largest and most heavily used databases. Email reached in September, 13th: ….. Congratulations on achieving Grand Prize award winner status (1) in Database Size, Other, All and TopTen Winner status Database Size, Other, Linux;Workload, Other, Linux in Winter Corp.'s 2005 TopTen Program! ....... (1) Grand prizes are awarded for first place winners in the All Environments categories only. WDCC's CERA DB has been identified as the largest Linux DB. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf

WS Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences

WS Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences

Presentation Transcript

The case of biomedical sciences

DIVISION OF BIOMEDICAL SCIENCES

The Physical Sciences

School of Biomedical Sciences Biomedical Science BSc

School of Physical Sciences

Biomedical Sciences Program

Burnett College of Biomedical Sciences

School of Physical Sciences

Biomedical Sciences Program

Biomedical Sciences Program Principles of Biomedical Sciences

Principles of Biomedical Sciences Review

Principles of Biomedical Sciences

The case of biomedical sciences

The Physical Sciences

BIOMEDICAL SCIENCES

Entrepreneurism in the Biomedical Sciences

Marine Sciences Oceanography Marine Biology Geosciences

Biomedical Sciences

Symbiosis School of Biomedical Sciences: Paving The Path For Excellence In Biomedical Sciences!