The design and implementation of the Neurophysiology Data translation Format (NDF)

The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer Architectures Group, Dept. of Computer Science, University of York, York, YO10 5DD, UK. {bojian, martyn.fletcher, austin}@cs.york.ac.uk} Presented by Leslie Smith, University of Stirling l.s.smith@cs.stir.ac.uk

Overview • Data problems / issues. • Our solution: Neurophysiology Data translation Format: NDF. • What is NDF and what does it provide? • Future work.

The CARMEN Project • The CARMEN (Code Analysis, Repository and Modelling for e-Neuroscience) project provides an environment for sharing neurophysiological experimental data and algorithms using GRID technology. • A consortium effort to create a virtual laboratory for neurophysiology, led by 11 UK universities in collaboration with other academic and commercial partners, for the benefit of the neuroscience community.

Data interchangability problem • The CARMEN system has to handle a wide range of incoming data types as well as derived data. • Often unreadable unless you use vendor specific software or know the encoding format • Data may be used by users or services. • In a processing chain, the output of a service may be the input of the other services. • It is impractical to have services that use arbitrary input and output data formats, particularly for workflows • There is a need for data translation • to allow resources to access a standard data format • to facilitates an environment where data can be processed in a consistently interpretable way for both human users and machines.

Remote Data Issues Remote data: to avoid unnecessary data downloading / moving and processing: • A user needs to know as much information as possible about the data before the data is downloaded or processed. • A service needs to verify the data as a valid input type before processing the data. • A workflow editor needs information to pre-verify the type of the input data set from a remote data depository or output from another service in the construction of a workflow script. Questions: • How do we interrogate and understand the remote data without downloading / accessing the whole binary data set? • A file extension is not enough to pre-verify workflow input / output file, so where is a workflow editor to get information to perform the verification?

Partial data access issues Sub-dataset selection and partial data extraction / downloading: • Neurophysiological experimental data are complex data sets. Most CARMEN services are designed to process only one of the data types within a data set. • Raw data contains multiple channels from the acquisition equipment but only parts of these data channels may be desired. • The volume of data in a channel of data may be very large but only some channels and time intervals are of interest. • Processed data and raw data may be mixed in the same data set. Question: • Can we tell a service exactly which data portions we need to process? • Can we download (or use) only the channels (or parts of channels) of interest?

Evolving data type issues In a research environment new data types / formats are created whenever new scientific instruments or services / algorithms are introduced. It is difficult / impossible to try to specify these precisely in advance. Questions: • Can we create services that accept new data types as input? • Can we create services that create a new data types as output? • Can all this be done in a consistent manner, using the predefined data types? • How can a service that uses new data types perform pre-verifying as for the predefined data types?

Can a well designed metadata system solve the problems? • Use of a generic metadata system: most users are specialists and will not appreciate many of the generic metadata specifications. On manual completion of a metadata upload form, a user doesn’t know which fields are required for the data set. Consequently, the metadata uploaded may be incomplete and not usable. • On uploading a data set, the metadata may not be directly available for the user – a special tool for a particular data format may be required. • It is impractical to upload metadata manually for a huge number of data files. • Automatically uploading metadata is equivalent to having a data standard. This implies that the metadata is already included in the data set and a data standard must be used. • Metadata for a temporary data sets, such as the output of a service (which may be the input of the other services) are not available from the metadata system. • Separating the metadata from a data set affects the data set portability. Our conclusion: The metadata used for the above purpose should be integrated with the data set.

Basic data types • The primary data types are • TIMESERIES: continuous time series. • NEURALEVENT: events such as spike times • EVENT: other event data (e.g. stimuli) • SEGMENT: sections of TIMESERIES data • GMATRIX: generic matrix data: user-defined • IMAGE: image data • Since the content is described using XML, additional data types can be added to cope with new developments in electrophysiology.

The NDF data format (1) The NDF wraps metadata, binary data together with a configuration file. • A separate NDF configuration file, using an XML format, minimizes the work necessary to extract metadata from a data set, obviating the need to look inside the associated binary data file. It is only necessary to download the NDF configuration file and the metadata information can be easily viewed using a web browser. • Two semi-defined data types are extendable on an application basis and conventional vendor data files may also be “wrapped” as an NDF data set. A particular ID field allows these application specified data to be identified. • NDF supports the most commonly used numerical data types from 8-bit integer to double precision floating point data. This helps to reduce the data size by using the most efficient data types as well as reducing the network traffic load when downloading / uploading NDF data sets. • The NDF data format permits the download of data “regions of interest” (partial data access) rather than the whole data set, reducing network traffic. Partially accessing a MAT file zipped stream is supported.

The NDF data format (2) • For a data processing chain a history or “foot-print” of each previous process can be included in the output data. This information is useful (may also be required) for later processing or reference. In particular, other researchers can easily repeat the work by reference to the data processing history records. • The NDF supports image data or image sequence data. • A separate XML file can be used to store the experimental event data, annotation and additional third party data objects. • The NDF minimizes the need for re-implementation of research tools currently used by neuro-scientists and researchers. A MAT file is used as the main numerical data file format. • This is a publicly descrbed data format • Supports multiple data files for one data channel. This allows data size of either a single channel or full data set to exceed 2GB both in 32-bit and 64-bit operating systems.

The CARMEN Portal NDF Data Channel & Time Selector

The NDF Data I/O API (1) The NDF API: • Is implemented as a C library. • Provides a low level I/O interface for accessing the NDF data set including the XML format header file, MAT format host data files and the XML format annotation files. • Translates the XML tree/node to C style data structures. • Insulates the MAT data format and (and image format data) from the clients. • Provides a standard way for data structure memory management.

The NDF Data I/O API (2) The NDF API: • Supports multiple-run data writing modes for large data sets with known total data length. • Supports multiple-run data writing modes for data stream with unknown total data length. • Supports zipped data stream for MAT file. • Supports partial data reading on both compressed and uncompressed data in MAT file. • Automatically manages the data file splitting for large data set.

The NDF MatLab Toolbox The NDF MatLab Toolbox has been implemented on top of the NDF C library API. • It consists of a set of object oriented MatLab classes and functions that provide high level support for NDF data I/O. • A “multiple data formats” to NDF converter is embedded to the toolbox as data input module. • Full protection and auto-correction for misused data types on parameter structure. • It has been used within the CARMEN service code programming. • It is also used as a set of convenient tools on a researcher’s desktop for NDF data I/O and data conversion.

Future work • Expand the specification to improve compatibility with data sets from the fields other than neuro-science. • Provide services for partial data downloading of remote data sets. • Provide services for data preview of remote data sets. • Extend the data converter to support data conversion from additional appropriate formats. • … and enabling future-proofing! • Detailed information is available at the CARMEN portal, https://portal.carmen.org.uk/

The design and implementation of the Neurophysiology Data translation Format (NDF)

The design and implementation of the Neurophysiology Data translation Format (NDF)

Presentation Transcript

The Design and Implementation of

The Design and Implementation of Declarative Networks

Evaluating the Adequacy of the CNCPS Model

New Data, Efficiency and Background Models

Data Warehousing Design And Implementation

Using Muon Removed files to assess the purity of the nubar-PID selection

Design and Implementation of An RDF Data Store

2 Data Design and Implementation

Data Design and Implementation

Chapter 2 Data Design and Implementation

Neurophysiology, Neurotransmitters and the Nervous System

The Design and Implementation of the STAR Tag Data Base

Chapter 2 Data Design and Implementation

New Developments in Analytical Evaluation of Forages and Total Mixed Rations

VA Drug Terminology Projects Update: NDF-RT, the New Drug Transaction, And Beyond

JSON The Data Transfer Format of the Stars

NDF Electrical Provide the Most able Electrician Gold Coast

Design and Implementation of

The Neurophysiology of Stress and Depression

JSON The Data Transfer Format of the Stars