Mapping Physical Formats to Logical Models to Extract Data and Metadata

Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

The problem & solutions • Wide range of files and formats • Standard formats • Prescriptive parsers • Arbitrary formats • Machines need to merge, parse, and generally comprehend these various formats • Potential Solutions: • Data must adhere to a pre-specified format • Customized programs are written for each format and version • Users describe the format of their data and use tools to convert the data to a widely used and machine understandable format (e.g. XML) 2

Descriptive Parser solution- DFDL • Data Format and Description Language • Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model. • Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07” <step id="5"> <density unit="kg/m**3">935.091</density> <temp>263.227</temp> <pressure>-6.20633E7</pressure> </step> 3

Example DFDL Schema <element name=“step”> <xs:annotation> <xs:appinfo> <dfdl:repType>text</dfdl:repType> <dfdl:charset>UTF-8</dfdl:charset> <dfdl:separator>,</dfdl:separator> </xs:appinfo> </xs:annotation> <complexType> <attribute name=“id” type=“xs:integer” use=“required”/> <sequence> <element name=“density” type=“xs:float”> <complexType> <attribute name=“unit” type=“xs:string” fixed=“kg/m**3”/> </complexType> </element> <element name=“temp” type=“xs:float” /> <element name=“pressure” type=“xs:float”/> </sequence> </complexType> </element> 4

Defuddle Parser Design • An implementation of the DFDL specification 5

Capabilities • Basic • Binary/text parsing of simple types • Basic math operations • Looping • Conditional logic • Use of regular expressions for separators and terminators. • Input from multiple data sources. • Advanced • External translators • Specify intermediate layers in the data which can be used for processing, but are not reflected in the output 6

Parsing Complex Formats • Scientific formats that Defuddle capabilities have been demonstrated on: • CHEMKIN solution file • NWChem molecular dynamics property file • NWChem electronic structure output file • Microarray and Protein-Protein interaction spreadsheets • Transformations within scientific workflows to avoid custom programming • Other formats that we would like to see handled in the future… HDF, jpeg, etc. 7

What problems does Defuddle address? • Integrating different data formats, for collaboration of data generated before/without standardization. • Naming/identification of arbitrary file sub/super-structures • Long-term preservation and reading of data when the applications used to create it are no longer available. • Efficient, general data access capabilities • Random access • Data Virtualization • Multiple descriptions of the same data • Using DFDL and DFDL-1 as general subsetting/transformation mechanism • Metadata Extraction 8

Extracting metadata • SAM • DFDL+XSLT • Benefits of automatic provenance/annotation capture • Example use: Microarray data – extracting header information • Application to Provenance 9

Discussion • Challenges • Efficient and Generic – Is it possible? • Size • Variable length text • Data Virtualization, providing an abstract view of the data, independent of underlying storage system • Naming of data subsets, map name to reference of logical model, not physical. Eg: //step[5]/pressure <step id="5"> … <pressure>-6.20633E7</pressure> </step> 10

Questions? • http://sdg.pnl.gov • http://defuddle.pnl.gov • http://forge.gridforum.org/projects/dfdl-wg • Tara.Talbott@pnl.gov 11

Mapping Physical Formats to Logical Models to Extract Data and Metadata

Mapping Physical Formats to Logical Models to Extract Data and Metadata

Presentation Transcript

Logical Protocol to Physical Design

LOGICAL TO PHYSICAL MODEL CONSIDERATIONS

From LOGICAL to PHYSICAL

Logical Models

Using FITS to Identify File Formats and Extract Metadata

Physical and Logical Structure

Logical And Physical Design

Mapping Internet to Physical Addresses

Physical and Logical Topologies

Flawless Logical to Physical Data Model Transformations

Physical and Logical Topologies

Standard Metadata in Scientific Data Formats

Mapping Logical Network Routing to Physical Network Routing *

Data formats, metadata standards, conventions, reading and writing data and information

Mapping Models to Code

EER to Relation Models Mapping

API to Extract Stock Data

Logical Protocol to Physical Design