160 likes | 399 Views
GCE Data Toolbox -- metadata-based tools for automated data processing and analysis. Wade Sheldon University of Georgia GCE-LTER. Rationale.
E N D
GCE Data Toolbox -- metadata-based tools for automated data processing and analysis Wade Sheldon University of Georgia GCE-LTER
Rationale • Data processing, quality control, data analysis and metadata generation traditionally carried out as separate activities, often in different time frames using different technologies • Problems: • Metadata may not reflect all processing steps • Much routine data analysis done w/o Q/C, metadata • No economy of scale – leads to “one-off” solutions • Metadata generation should ideally occur throughout the data cycle and “inform” data analysis
Design Goals • Develop Integrated Storage Standard • Tabular Data • QA/QC Information • Metadata (overall data set & columns/attributes) • Develop Software to Support Standard • Code Library/API • User Interfaces • Apply Technology to Acquire, Manage, Distribute GCE-LTER Data • Explore Use as Prototype Technology for Metadata-based Data Processing, Synthesis
Storage Standard • Developed Using MATLAB® • Local expertise, large scientific user base • Cross-platform (Win32, Solaris, *nix, Mac OS/x) • Rapid development environment • Supports multiple interfaces (interactive command line, batch-mode scripts, GUI, WWW) • Good interoperability with other technologies (Java, PERL, SQL) • Defined “GCE Data Structure” Spec. (based on MATLAB/C structures) • Structure with 17 named fields • Specific content rules for each field (software validation) • Combines data, metadata, QA/QC, processing history
Storage Standard GCE Data Structure Specification (v1.1)
Software – GCE Data Toolbox • Core Function Library • Create, Validate Structures • Import Data, Metadata (ASCII, MATLAB, SQL) • Manipulate Data, Metadata (unit conversions, add/delete/update) • Export Data, Metadata (various formats) • Dynamic, Rule-base QA/QC Flagging • Self-documenting Processing • Operation Logging (Processing History) • Transparent Metadata Creation/Updating • Dynamic (JIT) Metadata Generation for Columns • Support for Metadata “Templating” • Application of Boilerplate Metadata based on Parameter Matching • Supports Rapid Documentation of Routine Data Sources
Software – GCE Data Toolbox • Support for Analysis • Descriptive Statistics, Reports • Visualization, Mapping • Support for Synthesis • Composite Data Set Creation • Multiple Data Set Merge/Concatenation • Relational Join • Metadata Content Meshing • Data Set Summarization • Statistical Data Reduction/Re-sampling • Data Set Standardization • Unit Conversions (automatic, interactive) • Template-based Semantic Mapping • Automatic Semantic Mediation (prototype stage)
Software – User Interfaces • Unattended Batch Mode Processing • Interactive Command Line Processing (conventional MATLAB UI) • Full help text for each function • Well-defined input/output arguments • GUI Applications • Standard Forms, Dialogs, Controls • No MATLAB Experience Required • WWW – MATLAB Web Server • HTML Forms, Querystring Input • HTML Pages and/or Static File Output
Current Applications • Automated Data Processing • Direct data import from data logger files, WWW data sources (USGS), SQL queries • Automatic metadata creation (templates, data mining) • Rule-based QA/QC flagging • Data Set Packaging • Batch processing to create/update data, metadata products • On-demand generation of data, metadata, stat reports in custom formats (end-user scripts, GUI applications, WWW forms)
Current Applications • Data Exploration/Analysis by PIs • Descriptive Statistics based on attribute metadata • Visualization with Interactive Filtering (Frequency Histograms, 2D Plots, Map Plots) • Data Reduction/Re-sampling to Provide Customized Data at Various “Scales” • Aggregated Statistics • Binned Statistics • Query/Filtering (sub-selection)
Current Applications • Data Harvesting (GCE) • USGS Data (WWW real-time, daily, finalized data) • Campbell Scientific Data Arrays (post-processing triggered after LoggerNet Retrieval) • Sea-Bird Hydrographic Data • USGS Data Harvesting Service for HydroDB • Weekly harvest for 31 stations/7 LTER Sites • Automatic Resampling, Unit Conversions, Q/C
Availability • Description, Screen-shots, Fully-functional Toolbox Available on WWW: http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm • Requires MATLAB 5.3, 6.0, 6.5 (any platform) • “Public” Version Compiled • Source Code Requests Considered on Case-by-Case Basis
Future Development Plans • EML 2.0 Support • Metadata-mediated Data Set Integration • Unit conversions • Re-sampling • More WWW Interface Development