1 / 15

Wade Sheldon GCE-LTER University of Georgia

Software Tools for Automated Metadata Creation, Metadata-mediated Data Processing and Quality Control Analysis – real time processing solutions for real-time data. Wade Sheldon GCE-LTER University of Georgia. Ecoinformatics Challenges.

keala
Download Presentation

Wade Sheldon GCE-LTER University of Georgia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Tools for Automated Metadata Creation, Metadata-mediated Data Processing and Quality Control Analysis – real time processing solutions for real-time data Wade Sheldon GCE-LTER University of Georgia

  2. Ecoinformatics Challenges • Ecologists use heterogeneous data from many sources for synthesis (often processed using multiple tools, technologies) • manually-collected data (spreadsheets, text files) • instrument data loggers (text files, telemetry) • WWW/network data stores (text/HTML/XML files, streams) • Data volumes increasing exponentially • Expectations and requirements for metadata quality/quantity increasing • Expectations for data accessibility increasing

  3. Problems • Data processing, QA/QC, metadata creation are often the limiting factor in IM (not acquisition) • Disparity between capacity, expectations forces trade-offs: • rapid posting of provisional data – no QA/QC, minimal metadata • slow posting of finalized data (months-years)

  4. Potential Solutions • Improve “scope for automation” at all levels • Develop dynamic, flexible QA/QC process to improve efficiency • Adopt a more unified approach to data processing, so data processing, QA/QC and metadata creation occur simultaneously

  5. Typical Data Processing Scenario Processing Stages Raw/Unprocessed Digitized/Acquired Validated Standardized Quality-Controlled Finalized Metadata Customized/Modified

  6. Ideal Data Processing Scenario Processing Stages Raw/Unprocessed Digitized/Acquired Metadata Validated Metadata Standardized Metadata Quality-Controlled Metadata Finalized Metadata Metadata Customized/Modified

  7. Approach at GCE • Developed a universal tabular data storage format (GCE Data Structure) and modular software (GCE Data Toolbox) for data processing • Used MATLAB® • Local expertise, large scientific user base • Cross-platform (Win32, Solaris, *nix, Mac OS/x) • Rapid development environment • Supports multiple interfaces • Good interoperability with other technologies (Java, PERL, SQL)

  8. GCE Data Structures GCE Data Structure Specification (v1.1)

  9. GCE Data Toolbox • Toolbox functions support: • Importing data from all common formats (ASCII, ML, SQL) • Performing dynamic, rule-based QA/QC flagging (with support for inter-column dependencies) plus interactive manual flagging • Dynamically generating metadata using a combination of “templating”, automatic, and manual entry approaches • Exporting data and metadata in multiple ASCII/ML formats • Data transformation, including unit conversions, geographic coordinate re-projection, date/time conversions • Statistical analysis, sub-setting, super-setting, data visualization on plots, maps • Metadata queried for all operations (mediation) • All operations and data changes transparently logged and synchronized with metadata • Metadata from multiple structures “meshed” after merge/join to retain information during synthesis

  10. Interfaces • Developed multiple interfaces for the software • Command line (supports unattended batch-mode & interactive processing) • Desktop GUI application (requires no MATLAB expertise, uses standard dialogs/controls) • Web application with HTML forms, query string input

  11. Current Applications • Processing, QA/QC of all GCE monitoring data • Data packaging for WWW distribution (linked to Metadata RDMS) • WWW application for data set customization • Automatic near-real-time data harvesting, processing, WWW-posting: • USGS data (2 GCE stations) • CSI climate station • YSI hydrographic data logger • USGS Data Harvester for HydroDB (31 stations/7 LTER sites)

  12. Software Development • What resources were available • None – completely de novo project • Need for the tool • Efficiently process monitoring data from sensor networks and stations with near 0 manpower • Time to develop • Software is a core component of the GCE-IS, with development spread over 2.5 years (effort hard to quantify, but likely 3-4 months) • Scalability • Performs well with data sets <100k records (10-20k commonly used), but memory and speed may become limiting >100k • Some extensibility features incorporated (import filters, templates, metadata styles, unit conversions) • Portability • Requires MATLAB 5.3-6.5 (commercial), but both code and binary data files fully compatible with any Java-supported platform

  13. Software Demo

  14. Availability • Description, screen-shots, fully-functional software available on WWW: http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm • Requires MATLAB 5.3+ (6.0+ recommended) on any supported platform (Win32, Solaris, *nix, Mac OS/x) • “Public” version compiled but includes command-line help and some user extensibility • Source code requests considered on case-by-case basis

  15. Future Development Plans • EML 2.0 support • Fully automated metadata-mediated data set integration • Automatic unit conversions • Scaling (e.g. time frequency) • More WWW interface development

More Related