1 / 70

Data Formats, Standards, and Provenance in Data Science

Learn about different data formats, metadata standards, and conventions, as well as reading and writing data and information with a focus on provenance. Covers ASCII, UTF-8, ISO-8859-1, self-describing formats, database, graphs, and more.

Download Presentation

Data Formats, Standards, and Provenance in Data Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data formats, metadata standards, conventions, reading and writing data and information (provenance) Peter Fox and Greg Hughes Data Science – ITWS/CSCI/ERTH Module 3, September 13, 2016

  2. Contents • Assignment on data collection exercise • Reading from last week • Data formats • Metadata standards, conventions, • Reading and writing data and information (embedded), where does provenance show up? • Next week (data analysis I)

  3. Data Formats • We will cover some (not all) • ASCII, UTF-8, ISO 8859-1 • Self-describing formats • Table-driven • Markup languages and other web-based • Database • Graphs • Unstructured

  4. ASCII • American Standard Code for Information Interchange • http://www.webopedia.com/TERM/A/ASCII.html • Table of characters http://www.webopedia.com/quick_ref/asciicode.asp • ISO-8859-1 (aka ISO Latin 1) is a superset of ASCII – used on the web to represent ‘non-ASCII’ characters • Non-printing characters

  5. Example – good or bad?

  6. Example – good or bad?

  7. Example – good or bad?

  8. Example – good or bad? Where is the data? Where is the provenance?

  9. Example – good or bad?

  10. EBCDIC • Extended Binary-Coded Decimal Interchange Code • IBM • 7-bit, 8-bit, 9-bit • If someone mentions this just RUN away • Seriously, it is okay to convert

  11. Making ASCII more useful • Delimited: CSV or tab or (gulp) ASCII space • Improves parsing • How to handle special characters? • How to handle ambiguous delimiters? • Moving them in/out of “Excel” for e.g. • Templates • Encoding strings, e.g. • ‘f7.4, c32, i8,f5.3,e9.4’

  12. Reading and writing ASCII • Text editor (vi, emacs, wordpad) • Dangers exist in applications that add hidden formatting, i.e. characters • Even cr, lf (two ASCII characters, non-printable) can often cause major reading and interoperability problems • Procedural languages (e.g. C) • Interpreted languages (e.g. Perl, Python) • Data structures are utilized (e.g. typed arrays, often multidimensional) to provide logical organization to data that is read in, or in preparation for writing it out

  13. Data in “data structures” • JSON – JavaScript Object Notation json.org/example {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}

  14. Data in “data structures” • JSON – JavaScript Object Notation json.org/example {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }} The same text expressed as XML: <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu>

  15. Data in “applications” • Increasing trend in storing data in application files, e.g. Excel, .mat (Matlab), .sav (IDL), … • What advantages? • Ready to use • Data structures are provided • What problems? • Data structures may not match the underlying data representation (model), i.e. information and data may be lost (e.g. float instead of double) • Format versions • Interoperability – can it be read by another app?

  16. FreeForm • 15+ years ago there was an attempt to provide a templated (almost table driven) approach • Good homework assignment when you are bored – find out why it was created and what happened to it

  17. Spreadsheets • E.g. Excel – import data, Save As csv

  18. Documentation?

  19. CDF • Common Data Format • The Common Data Format (CDF) is a self-describing data format for the storage and manipulation of scalar and multidimensional data in a platform- and discipline-independent fashion • Although CDF has its own internal self describing format, it consists of more than just a data format. CDF is a scientific data management package (known as the "CDF Library") which allows programmers and application developers to manage and manipulate scalar, vector, and multi-dimensional data arrays

  20. CDFML • The CDF office realized that scientific progress is often impeded by the lack of, or excessive multiplicity of, available standards for data formats and structures and/or data format translators. In a bid to facilitate and promote data sharing with other data formats, the CDF office has decided to adopt Extensible Markup Language (XML) as a basis for establishing interoperability with other scientific data formats and created CDF Markup Language (CDFML) to describe CDF data and metadata.

  21. netCDF • Network Common Data Format (and API) • Self describing – what does this mean? • Variables, dimensions, types, attributes, coordinates • nc_dump • nc_open • nc_inquire • nc_dim • nc_varget/ put • nc_attget/ put

  22. NcML • NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command. • NcML is similar to the netCDF CDL (network Common data form Description Language), except, of course, it uses XML syntax. • http://www.unidata.ucar.edu/software/netcdf/ncml/ • NetCDF-Java library support - http://www.unidata.ucar.edu/software/netcdf-java/index.html

  23. HDF5 • HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. • HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. • The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format. • VERY complex API

  24. HDF4 • At its lowest level, HDF is a physical file format for storing scientific data • At its highest level, HDF is a collection of utilities and applications for manipulating, viewing, and analyzing data in HDF files • Between these levels, HDF is a software library that provides high-level APIs and a low-level data interface

  25. HDFEOS • A variant of HDF for the Earth Observing System (EOS) • http://hdfeos.org/ • More under metadata later

  26. HDFEOS Profiles over time

  27. Common Data Model • Combines netCDF and HDF into one model, and API • Uses the underlying HDF format representation but uses the netCDF v4 API • Simplifies access • Version 4: https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/ (note the “data model”)

  28. FITS • FITS stands for `Flexible Image Transport System' and is the standard astronomical data format endorsed by both NASA and the IAU. • FITS is much more than an image format (such as JPG or GIF) and is primarily designed to store scientific data sets consisting of multi-dimensional arrays (1-D spectra, 2-D images or 3-D data cubes) and 2-dimensional tables containing rows and columns of data. • Many APIs

  29. TIFF/GeoTIFF • Tagged Image File Format 24-bit support • http://www.libtiff.org/ • GeoTIFF is a public domain metadata standard which allows georeferencing information to be embedded within a TIFF file. • The potential additional information includes projections, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file. • The GeoTIFF format is fully compliant with TIFF 6.0, so software incapable of reading and interpreting the specialized metadata will still be able to open a GeoTIFF file.

  30. RDBMS (in one slide) • A Relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd. • Most popular commercial and open source databases currently in use are based on the relational model. • A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables.

  31. BUFR • Binary Universal Form for the Representation of meteorological data (BUFR) is a binary data format maintained by the World Meteorological Organization • The latest version is BUFR Edition 4 • BUFR Edition 3 is also considered current for operational use • http://www.wmo.ch/pages/prog/www/WMOCodes/OperationalCodes.html

  32. BUFR structure • A BUFR message is composed of six sections, numbered zero through five. • Sections 0, 1 and 5 contain static metadata, mostly for message identification. • Section 2 is optional; if used, it may contain arbitrary data in any form wished for by the creator of the message (this is only advisable for local use). • Section 3 contains a sequence of so-called descriptors that define the form and contents of the BUFR data product. • Section 4 is a bit-stream containing the message's core data and meta-data values as laid out by Section 3.The product description contained in Section 3 can be made sophisticated and non-trivial by the use of replication and/or operator descriptors.

  33. GriB • GRIB (GRIdded Binary) is a mathematically concise data format commonly used in meteorology to store historical and forecast weather data • Significant amount of software available • See wikipedia page for more details

  34. ESML • ESML is an interchange technology that enables data (both structural and semantic) interoperability with applications without enforcing a standard format within the Earth science community. • Users can write external files using ESML schema to describe the structure of the data file. • Applications can utilize the ESML Library to parse this description file and decode the data format. • Software developers can build data format independent scientific applications utilizing the ESML technology. • Semantic tags can be added to the ESML files by linking different domain ontologies to provide a complete machine understandable data description. • ESML description file allows the development of intelligent applications that can now understand and "use" the data.

  35. ESML • Earth Science Markup Language • http://sourceforge.net/projects/esml/ • Schema • Editor • Library • Tutorial • Application API - IDL

  36. CSML • http://csml.badc.rl.ac.uk/ • Climate Science Markup Language • CSML is a standards-based data model and GML (Geography Markup Language) application schema for atmospheric and oceanographic data with associated software tools developed at the Rutherford Appleton Laboratory. • Java library at: http://csml.badc.rl.ac.uk/java/

  37. CSML Java code ProfileCoverage cov = ...; PrintStream out = ...; // e.g. System.out RecordType rangeType = cov.getRangeType(); out.println("<table>"); for (Record record : cov.getRange()) { // Each Record is a row in the table out.print("<tr>"); for (String memberName : rangeType.getMemberNames()) { // Each member represents a different Phenomenon and // is a column in the table. out.print("<td>" + record.getValue(memberName) + "</td>"); } out.println("</tr>"); } out.println("</table>");

  38. RDF • http://www.w3.org/RDF/ - Resource Description Framework • Read the introduction and overview • Graph representation and encoding • RDF the model and RDF/XML the encoding • Many tools, and very good language support • Is the foundation of ‘data on the web’, see www.linkeddata.org • JSON-LD (JSON for Linked Data) • We cover this more in a later class

  39. Break?

  40. Metadata formats • Fall into three categories • Unstructured and disconnected • With the data • ‘Close’ to the data • See the ASCII example and contrast this with the netCDF example • Structure around metadata is very important • Vocabulary (constraints) are also very useful • We dream of contextual metadata…

  41. Dublin Core • DCMI is an open organization engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. • ISO Standard 15836-2003 of February 2003 • ANSI/NISO Standard Z39.85-2007 of May 2007 • IETF RFC 5013 of August 2007 • Metadata element set - http://dublincore.org/documents/dces/ • Metadata terms - http://dublincore.org/documents/dcmi-terms/

  42. DC Type Vocabulary - Sect. 7 • Collection • Dataset • Event • Image • InteractiveResource • MovingImage • PhysicalObject • Service • Software • Sound • StillImage

  43. METS • The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

  44. METS example

  45. METS example profile

  46. Time • ISO 8601 specifies numeric representations of date and time. • helps to avoid confusion in international communication due to different national notations • increases the portability of computer user interfaces • Good read: http://www.cl.cam.ac.uk/~mgk25/iso-time.html • In XML encodings, see xsd:datetime • http://www.w3.org/TR/NOTE-datetime • http://www.w3.org/TR/xmlschema-2/

More Related