GALION Data Management Strategies: Centralized vs. Distributed Approach

Where Should the GALION Data Reside? Centrally or Distributed? Introduction to the Discussion Fiebig, M.; Fahre Vik, A. Norwegian Institute for Air Research

User Demands for GALION Data Management • Data should be easy to find and accessible via one common location. • Data should be searchable by location, time window, parameter, … • Plotting and browsing tool for online comparison. • Data should be downloadable in homogenous format, option for user selection between a few commonly used formats. • Data should be of homogenous high quality, including detailed documentation of processing steps for assessing comparability. • Different applications require different proximity to raw measurement. • Data should include a measure of uncertainty and variability. • Data should be available in near-real-time (crisis management, forecast, …) -> one location, one format! • Option for aggregating datasets into climatologies. • …

Current Strategy for Data Management in GALION • At least one common point of access for common data pool. • Responsibility for QA and long-term availability remains with contributing institutions / networks. • Features of common access portal: • Holds access metadata from all contributing stations, i.e. dates, times, and type of measurements. • Allows search with criteria as network, date, location, … • Browsing / quicklook of data. • Link to download from original location. • Tools for format conversion. • Control of access rights.

Solution 1: GAWSIS as Data Discovery Portal

GAWSIS Features • Data directory encompassing all GAW data centres, holds access metadata. • Search data availability by country, network, station name, station ID, station type, and parameter. • Map visualisation of availability. • Station page with station metadata, available datasets list. • Link to original repository, direct link to dataset if available. • Functionalitysimilar to a Global Information System Centre (GISC) in WMO Information System (WIS) concept. • GAWSIS plans include WIS compliance (oncethat is defined) and plotting tool.

Solution 2: EARLINET-ASOS Database and Portal

EARLINET-ASOS Database Features • Search all EARLINET-ASAS data by date, daytime, season, station , event category, parameter. • Select and download data (NetCDF format). • Plotting, browsing, comparing function. • The EARLINET-ASOS database will be part ofthe ACTRIS distributed database, which is planned to be WIS compliant (whenwe know whatthatmeans). • ACTRIS: EU FP7 project, willnetwork European ground-based in situ & lidar aerosol observations, cloudpropertyobservations, and reactive trace gas observations.

Solution 3: GEOmon Distributed Database • Data discovery portal holding access metadata. • Data may be searched by parameter, station, home database, type (in situ, remote sensing, simulation), platform, matrix, geolocation, altitude, temporal availability. • Portal links to individual dataset where possible, to database homepage otherwise. • Will be developedintoentry portal of ACTRIS distributed database.

Distributed Data Architecture Pros & Cons • Pros: • Institutions / networks keep control over data access, data quality, long-term availability and maintain visibility. • Know-how on measurement principle and data management is combined for tailored solutions. • Cons: • All institutions / networks have to maintain server infrastructure (file archive, metadata server, webservice, WIS compliance, …) • Well defined formats are essential for smooth interoperability. Implementing on-the-fly conversion of dozens of formats would be resource drain and predefined vulnerability. • Near-Real-Time dissemination with uniform QA almost impossible to implement. • Long-term availability not ensured.

Centralised Data Architecture Pros & Cons • Pros: • Server infrastructure needs to be maintained only once / few times (economy of scale). • Long-term availability ensured. • Easy to ensure homogenous data formatting and quality, frequent reformatting not necessary. • Almost the only option for implementing NRT service with homogenous automated QA. • Cons: • Somewhat less visibility of individual institution / network. • Institution(s) hosting data centre(s) need to ensure access management. • Institution(s) hosting data centre also need experimental expertise.

Well-Defined Common Data Formats are Essential for any Data Architecture • Data format is more than just selecting NASA-Ames, NetCDF, … • Needs to include: implementation profile for format standard and defined vocabulary, i.e. which parameteres / metadata are included in what unit and how are they named, which processing steps were conducted, all self-explaining, flags to indicate special conditions. • Example EUSAAR data formats (all NASA-Ames 1001): • Level 0: Annotated, instrument specific raw data, ”native” time resolution. • Level 1: processed to final physical variable, original time resolution. • Level 1.5: automatically aggregated to (hourly) averages, includes uncertainty for averaging period. • Level 2: same as level 1.5, but manually quality assured. • Well-defined common processing steps between levels establish traceability. • Well defined formats don’t limit usability of data, but make routine work more efficient.

Efficient Use of Project Resources: GAW aerosol NRT automatic feedback • Sub-network data centre: • auto-createshourly data files (level 0). • initiatesauto-upload to NRT server. • Station: • collectsraw data in custom format FTP transfer to data centre transfer • Data Centre: • check for correct data format (level 0). • checkwhether data stayswithinspecifiedboun-daries (sanitycheck). FTP transfer to data centre • Station: • auto-createshourly data files (level 0). • initiatesauto-upload to NRT server. automatic feedback Useraccess (restricted) via web-interface: ebas.nilu.no Processing to level 1.5 Processing to level 1 EBAS database Useraccess via machine-to-machineweb-service Hourlylevel 1 data file Hourlylevel 1.5 data file

How Do You Access the Data?

NRT-Example: Auto-Processed DMPS data

GALION Data Management Strategies: Centralized vs. Distributed Approach