1 / 11

Mapping Physical Formats to Logical Models to Extract Data and Metadata

Mapping Physical Formats to Logical Models to Extract Data and Metadata. Tara Talbott IPAW ‘06. The problem & solutions. Wide range of files and formats Standard formats Prescriptive parsers Arbitrary formats Machines need to merge, parse, and generally comprehend these various formats

faxon
Download Presentation

Mapping Physical Formats to Logical Models to Extract Data and Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

  2. The problem & solutions • Wide range of files and formats • Standard formats • Prescriptive parsers • Arbitrary formats • Machines need to merge, parse, and generally comprehend these various formats • Potential Solutions: • Data must adhere to a pre-specified format • Customized programs are written for each format and version • Users describe the format of their data and use tools to convert the data to a widely used and machine understandable format (e.g. XML) 2

  3. Descriptive Parser solution- DFDL • Data Format and Description Language • Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model. • Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07” <step id="5"> <density unit="kg/m**3">935.091</density> <temp>263.227</temp> <pressure>-6.20633E7</pressure> </step> 3

  4. Example DFDL Schema <element name=“step”> <xs:annotation> <xs:appinfo> <dfdl:repType>text</dfdl:repType> <dfdl:charset>UTF-8</dfdl:charset> <dfdl:separator>,</dfdl:separator> </xs:appinfo> </xs:annotation> <complexType> <attribute name=“id” type=“xs:integer” use=“required”/> <sequence> <element name=“density” type=“xs:float”> <complexType> <attribute name=“unit” type=“xs:string” fixed=“kg/m**3”/> </complexType> </element> <element name=“temp” type=“xs:float” /> <element name=“pressure” type=“xs:float”/> </sequence> </complexType> </element> 4

  5. Defuddle Parser Design • An implementation of the DFDL specification 5

  6. Capabilities • Basic • Binary/text parsing of simple types • Basic math operations • Looping • Conditional logic • Use of regular expressions for separators and terminators. • Input from multiple data sources. • Advanced • External translators • Specify intermediate layers in the data which can be used for processing, but are not reflected in the output 6

  7. Parsing Complex Formats • Scientific formats that Defuddle capabilities have been demonstrated on: • CHEMKIN solution file • NWChem molecular dynamics property file • NWChem electronic structure output file • Microarray and Protein-Protein interaction spreadsheets • Transformations within scientific workflows to avoid custom programming • Other formats that we would like to see handled in the future… HDF, jpeg, etc. 7

  8. What problems does Defuddle address? • Integrating different data formats, for collaboration of data generated before/without standardization. • Naming/identification of arbitrary file sub/super-structures • Long-term preservation and reading of data when the applications used to create it are no longer available. • Efficient, general data access capabilities • Random access • Data Virtualization • Multiple descriptions of the same data • Using DFDL and DFDL-1 as general subsetting/transformation mechanism • Metadata Extraction 8

  9. Extracting metadata • SAM • DFDL+XSLT • Benefits of automatic provenance/annotation capture • Example use: Microarray data – extracting header information • Application to Provenance 9

  10. Discussion • Challenges • Efficient and Generic – Is it possible? • Size • Variable length text • Data Virtualization, providing an abstract view of the data, independent of underlying storage system • Naming of data subsets, map name to reference of logical model, not physical. Eg: //step[5]/pressure <step id="5"> … <pressure>-6.20633E7</pressure> </step> 10

  11. Questions? • http://sdg.pnl.gov • http://defuddle.pnl.gov • http://forge.gridforum.org/projects/dfdl-wg • Tara.Talbott@pnl.gov 11

More Related