Enhancing Data Storage Flexibility: The MonAliSA Project's Innovative Formats

Flexible data storagefor minimal effort user A tale of two formats P Coe : MonAliSA project Oxford Physics

Persuasion • General problem to be addressed • First use case - ATLAS FSI c. 1998 • ATLAS FSI data format Widening the scope : • Recent work – MONALISA format 2006 P Coe : MonAliSA project Oxford Physics

Great Expectations Want to use disk files to record... • Data from experiments/simulations • Prepared data ready for analysis • Analysis results Want to see the data (not unrelated) • Plotting graphs is essential for analysis, publication etc P Coe : MonAliSA project Oxford Physics

Laboratory data sources... Take instrument readings... ...and deliberately induced signals ...measuring ambient conditions... P Coe : MonAliSA project Oxford Physics

Laboratory data is usually ... • A collection of one dimensional arrays • Counts from an ADC - 16 bit integer • Photon counts 16 or 32 bit integer, etc • Some arrays only have a single element • Ambient relative humidity – 1 float • Treating all data is if it were in arrays may be a presumptuous idea... but fruitful P Coe : MonAliSA project Oxford Physics

Data increases with compound interest... • Preparation of acquired data • filtering, averaging, smoothing, cutting, etc • Analysis of prepared data • fitting, calculating FFT, time derivatives etc • Data collation • measurement to measurement trends, etc Want to emphasise that it is all treated as one dimensional arrays of data : e.g. FFT spectrum in two 1-d arrays of doubles P Coe : MonAliSA project Oxford Physics

(META?) Let us not forget "annotation" Data • Experimental set-up (which instruments, how and where connected etc) • Other "one-off" parameters i.e. timestamp • Version info for DAQ/analysis algorithms • Seed parameters • Other fit/preparation control parameters • ... Software can deliver much more if annotations are included P Coe : MonAliSA project Oxford Physics

An ideal data file format • Holds data and meta data / related information • Should be simple to write code to : • find/read stored data of interest from the file • write any stored data to the file so it can be identified • append new data to the file, without disruption • Handle (store/retrieve) data : • flexibly (of any format, in any order) • reliably (data should come back intact) • robustly (absent data should not break the format) • with language / platform independence P Coe : MonAliSA project Oxford Physics

A database as a solution? • A database in place of a file meets most requirements of a file format • I have no database experience and did not want to be coupled to (tied down by) database related issues • For example...is it easy to access the same data using different languages? A pseudo-random quote from the web... “17.1. Do You Really Need a Relational Database? It is common for web developers to jump to the conclusion that they need an SQL-compliant RDBMS like Oracle, when in fact they have a rather small data set that could be organized as one table. Commercial RDBMSs are expensive as well as nontrivial to install and administer.” P Coe : MonAliSA project Oxford Physics

ASCII text file : Less than ideal (1) • Meta data • some may be in column labels • remaining meta data often poured out into the filename! • How easy it is to ... • ...find/read stored data of interest from the file? • Code for reading single lines/columns is reasonable • The price is a rigid file structure • ...write any stored data to the file so it can be identified? • Code for writing ASCII columns is very simple • Data Identity is based on assumptions about column order • ...append new data to the file, without disruption? • Maybe possible in Perl but... • ... in most languages it is easier to create a new file P Coe : MonAliSA project Oxford Physics

ASCII text file : Less than ideal (2) • What about handling data? : • flexibly (of any format, in any order) • Data Format is ASCII columns (or similar enough format) only! • Ordering flexibility is lost (unless only humans read) • reliably (data should come back intact) • Rounding / formatting issues – places effort on user • robustly (absent data should not break the format) • Empty columns do break the format • with language / platform independence • Most languages allow ASCII I/O • Most applications (Excel, Origin, etc) read ASCII • Minor cross platform issues – very rarely fatal P Coe : MonAliSA project Oxford Physics

Can a binary file format help? The answer depends greatly on the : • format design • flexibility is easily lost without careful forethought / revision • simplicity : The format should be “as simple as it needs to be, but no simpler” – paraphrasing Einstein (1933) • format implementation • design advantages are easily lost by the implementation P Coe : MonAliSA project Oxford Physics

P Coe : MonAliSA project Oxford Physics

Two binary formats • Old ATLAS format • relies on ID codes to identify data • has some naive file structures - too rigid • implemented in LabVIEW (and ROOT!) • New “MonAliSA” format • relies on ID codes to identify data • simpler and more flexible • implemented in C, Java and LabVIEW P Coe : MonAliSA project Oxford Physics

ATLAS demo FSI 1998- • Experiments based on tuning a laser • Timing of experiment has two "modes" based on rapid or slow laser tuning • DAQ & analysis data file structures reflect this • alternate fast/slow periods in "blocks" • All I/O software in National Instruments language LabVIEW (v4.x later v6.1 for analysis) P Coe : MonAliSA project Oxford Physics

Binary file format (ATLAS FSI 1998-) • Simple structure at highest level • A single, minimal sized file header • Followed by N data blocks • -Rarely of equal size File Header 3 4 5 2 Data block 1 location inside file Version Number + Number of data blocks 1 byte 2 bytes P Coe : MonAliSA project Oxford Physics

Inside a data block (roughly) Block prefix : points to a) start of first data label b) start of next data block also counts how many arrays stored 1-d array BIG ENDIAN or IEEE Header : most of the meta data Array label : Identifies array (details on next slide) P Coe : MonAliSA project Oxford Physics

ATLAS FSI format - data array label Location at end of the data array (pointer to next label) Fixed length String labelling the array contents Instrument Channel ID Number of array elements ID codes Element type Previous array up to start of this label Array attached to the end of the label Byte location inside file 3.1 unsigned 16 bit integer “Long Ref raw data”+ white space padding to fixed length P Coe : MonAliSA project Oxford Physics

ID code examples from ATLAS • 2 numbers in code (category, subcategory) • Categories : • DAQ timing parameters • Thermometer / humidity : "environmental" • Reference Interferometer System ... • Grid line interferometers • (Reference) Phase • Sine fitting prepared data ... P Coe : MonAliSA project Oxford Physics

ID code examples from ATLAS • Subcategories arbitrarily assigned by hand • In 3 (Reference Interferometer System) • 3.1 Long Reference Interferometer raw data • 3.3 Etalon raw data • 3.129 Long reference data for laser 1 • 3.257 Long reference data for laser 2 P Coe : MonAliSA project Oxford Physics

How does ATLAS format work?1) Writing data to the file • Meta structures in place first (tedious) • Each array with label : placed at end of file • Any order permitted by unique array ID codes (inside a given block at least) • Writing each data array involves : • Preparing meta data for label • Writing label, including pointer to end of the array • Writing the array after the label • Updating meta structures : end of block & no. of arrays P Coe : MonAliSA project Oxford Physics

How does ATLAS format work?2) Reading data from the file • Finding the correct data block is similar to finding array (below) • block label and pointer to next block are in block prefix • Find array : Seek array at (block, ID) • (In the correct block) 1st array label easily found from prefix • then iterate within the block... • Read array label : Do ID codes match two required? If YES read label & array at end of the label If NO find next label using pointer in this label – continue iteration • N.B. reading finds 1st matching instance (in block) only • should have only been one matching instance written P Coe : MonAliSA project Oxford Physics

Search pattern schematic e.g. Looking for 7.1028 data in block 3 There are 5 blocks I am block 3 There are 26 arrays in this block I am block 2 Block 3 starts here My first data label starts here 5 4 2 Data block 1 4 Labels compared with required 7.1028 ID code I am block 1 Block 2 starts here Array of interest I label 7.1028 data P Coe : MonAliSA project Oxford Physics

ATLAS format - review (1) Ideal : “Holds data and meta data / related information” Meta data storage was useful but on the down side was... • Scattered • some in the label • some in header • Inflexible • No easy way to augment meta data • All block header sections had to be complete or left out Also • Block headers added large effort overhead to writing new software • made innovation in other areas painful / tedious • not all block header meta data used, some never P Coe : MonAliSA project Oxford Physics

ATLAS format - review (2) Ideal : “Should be simple to write code to :” • “find/read stored data of interest from the file” • Very simple small stable I/O routines library • ID codes stored in one place • easy to maintain • easy to use • “write any stored data to the file so it can be identified” • Same success with same I/O routines and ID codes • “append new data to the file, without disruption” • Only possible to append to last block of the file P Coe : MonAliSA project Oxford Physics

ATLAS format - review (3) Ideal : "Handle (store/retrieve) data" • flexibly (of any format, in any order) • Storage order flexible (within a block) • All required numerical formats supported (string as bytes) • reliably (data should come back intact) • Never reported any data errors in 9 years • robustly (absent data should not break the format) • Absent data does not break the format • Earlier caveat about Meta Data applies • with language / platform independence • Never fully tested this point in the ATLAS format • Was beyond the scope of the implementation • Mostly used LabVIEW – format worked across LabVIEW versions • T. Kohno wrote a file reader for ROOT, no problems reported P Coe : MonAliSA project Oxford Physics

MonAliSA 2006 : A new format • Want to read/write files from different operating systems • C and Java for DAQ, analysis, simulation • Run C inside LabVIEW on windows XP (DAQ) • Most analysis / simulation on Linux • Java work with LiCAS P Coe : MonAliSA project Oxford Physics

Broadening the scope Once you have a cross platform file format : • Want to offer to ATLAS, LiCAS, etc... • Saves duplicating "reinvented wheels" • Same I/O software for each group • Hence same basic format / file structure • Using ID codes for data finding • ID code range expanded from 2 numbers to 5 • Each group will want control over their own ID codes / software versions P Coe : MonAliSA project Oxford Physics

Feeding lessons / requirements into the new format (1)... • Kept the Labelled Data arrays • Data labels retain ID codes, strings, pointers • Data labels drop meta data (instrument ID) • Removed data blocks • same DoF recorded with "instance" label element • now possible to append any data array to the end of the file OLD: Arrays stored in blocks block 1 block 2 block 5 1st 1st 2nd 3rd 1st 2nd 3rd NEW: Standalone Arrays 5th P Coe : MonAliSA project Oxford Physics

Feeding lessons / requirements into the new format (2)... • Removed almost all header structures • Meta data stored in arrays like other data • Very small remaining file header holds • file compatibility information • Group ID (e.g. 2 = MonAliSA) • Format version ID (file / header structures change) • ID codes look up table version This needs further explanation P Coe : MonAliSA project Oxford Physics

New format : Simpler file structure Array Label • ID codes • Instance count • Pointer to next label • Data type • Number of array elements • Error detection checksum • Variable length, label string One dimensional array • byte (can be general purpose) • 16,32,64 bit integers (big endian) • IEEE 32 bit float, 64 bit double File Header 13 bytes KEY TO TEXT COLOURS • Group ID • Software ID • ID codes version • File format version • File lock • Number of arrays in file RED : Centralised Set by protocol definitions BLUE : "Group" specific Managed by a "Group" File / array specific Immutable Mutable P Coe : MonAliSA project Oxford Physics

Obvious questions • Is reading / writing data with the new format similar to the old? – ANSWER YES! • Why does the new, simpler format appear to be so complicated? • What is all this about "central" and "group" management? • How much longer will this talk go on for? P Coe : MonAliSA project Oxford Physics

All because of ID codes ... • For the 1998 - ATLAS format : • ID codes were created and used • by 1 or 2 programmers • in the one set of software • written on one machine • in one language • Keeping ID codes unique and distinct was simple enough • For the 2007 – MonAliSA format : • ID codes could be created and used by • any users • for any software they require • written on any number of machines • in any language (although so far only C, LabVIEW, Java are possible) • ID code clashes need to be prevented P Coe : MonAliSA project Oxford Physics

ID code management : Central Any project / person producing software • Issued with a copy of GIACoNDE • a "Group" level ID code management tool • written in Java (for platform independence) • has group ID hardwired as a constant Any Binary File Reading software • Can check for matching group ID in file header P Coe : MonAliSA project Oxford Physics

ID code management : Group Uses GIACoNDE tool • Present state - Beta version • Creates ID codes with 5 parts • Publishes ID code template • ID codes represented by named constants • in C header files • in Java interface files • together with writing • group ID • software ID • ID code template version P Coe : MonAliSA project Oxford Physics

ID code management : GIACoNDE P Coe : MonAliSA project Oxford Physics

Future outlook • GIACoNDE is close to completion • already produces useable output • some polishing still to be done • File I/O libraries already written and tested • written in C, LabVIEW, Java • files written by one language can be read by another • other groups encouraged to use the software • Java I/O will be added to LiCAS framework P Coe : MonAliSA project Oxford Physics

Not just suitable for lab data... For example : Plan using binary I/O in next version of 3 player game "Austerlitz" for saving state of play P Coe : MonAliSA project Oxford Physics

THE END Thank you for listening P Coe : MonAliSA project Oxford Physics

Enhancing Data Storage Flexibility: The MonAliSA Project's Innovative Formats

Enhancing Data Storage Flexibility: The MonAliSA Project's Innovative Formats

Presentation Transcript

Flexible Storage Allocation

Data Storage

Data Storage

Flexible Transactional Storage

Minimal Metadata for Data Services Through DIALOGUE

Depot: Cloud Storage with minimal Trust

NeST: Network Storage Flexible Commodity Storage Appliances

NeST: Network Storage Flexible Commodity Storage Appliances

Data Collection Effort

Data Storage

NEESgrid Data Effort

Data Storage

Exerting Maximum Effort For Minimal Results

Flexible Storage Allocation

NeST: Network Storage Flexible Commodity Storage Appliances

Flexible Storage Solutions

Top Secrets To Cleaning House With Minimal Effort

NAS Server Systems for Flexible Storage Solutions