1 / 19

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data. Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia. Introduction. Quality Control of high volume, real-time data from automated sensors is an emerging challenge

meli
Download Presentation

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia

  2. Introduction • Quality Control of high volume, real-time data from automated sensors is an emerging challenge • Traditional techniques (plotting, stats) often don’t scale well • Data validation and Q/C can be limiting factor in getting data “online” • Difficulties lead to release delays or posting provisional data • Software developed at Georgia Coastal Ecosystems LTER has proven useful for Q/C of real-time data • Designed to automate GCE data processing and metadata generation, but very generalized and supports any tabular data • Provides dynamic, rule-based Q/C framework for data processing, analysis and synthesis

  3. Framework Components • Comprehensive data model • Implemented as hierarchical MATLAB ‘structure’ arrays • Package dataset & attribute metadata, data, Q/C rules, qualifier flags • Metadata-based MATLAB software (GCE Data Toolbox) • Automatic (rule-based) and manual assignment of Q/C qualifier flags • Transparent management of flags throughout all data manipulation • Q/C-aware data management and analysis tools • Q/C-aware data integration and synthesis tools • Modular implementation supports many scenarios • Interactive (command-line API and GUI forms) • Automated workflows (timed or triggered) • End-to-end (logger-to-scientist) or part of larger workflow • Runs natively on multiple platforms (PC, *nix, MacOS)

  4. GCE Data Toolbox Data Model

  5. Quality Control Rules • Basic syntax: [logical expression]=’[flag code]’ • Logical Expressions: • Any conditional statement or call to MATLAB function that returns logical array (0 = false, 1 = true) • Dataset columns referenced in statements as: • “x” – alias for current column (e.g. x<0) • “col_[name]” – any dataset column by name (e.g. “col_Depth<0”) • Flag Codes: • Alphanumeric character to assign when expression true (I, q, 9, *) • Codes defined in the dataset metadata (I = invalid value, …) • Unlimited rules per attribute, multiple flags per value

  6. Quality Control Rule Examples • Numeric Comparisons: • Simple: • x<0=‘I’ (flags negative values) • x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks)

  7. Quality Control Rule Examples • Numeric Comparisons: • Simple: • x<0=‘I’ (flags negative values) • x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) • Statistical: • x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’(flags values more than 3 standard deviations from column mean)

  8. Quality Control Rule Examples • Numeric Comparisons: • Simple: • x<0=‘I’ (flags negative values) • x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) • Statistical: • x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean) • Multi-column: • col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC) • col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight) • col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)

  9. Quality Control Rule Examples • Numeric Comparisons: • Simple: • x<0=‘I’ (flags negative values) • x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) • Statistical: • x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean) • Multi-column: • col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC) • col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight) • col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0) • Compound (Boolean operators): • col_RH_Percent>100&col_Precip<=0.1=‘Q’ (flags humidity > 100% except during significant precipitation events)

  10. Quality Control Rule Examples (cont.) • Text Comparisons: • “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists • flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’

  11. Quality Control Rule Examples (cont.) • Text Comparisons: • “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists • flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ • Algorithmic Criteria (custom functions): • fn(columns,parameters)=‘Q’ • Various included Q/C functions • pattern checks, geographic checks, specialized algorithms (O2 saturation, etc) • User-defined functions: • Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc • Unlimited scope

  12. Quality Control Rule Examples (cont.) • Text Comparisons: • “IS”, “NOT” for strings, “IN”, “NOT IN” for lists • flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ • Algorithmic Criteria (custom functions): • fn(parameters)=‘Q’ • Various included Q/C functions • pattern checks, geographic checks, specialized algorithms (O2 saturation, etc) • User-defined functions: • Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc • Unlimited scope • Full suite of MATLAB numeric analysis capabilities supported, and extensible to use other technology

  13. Q/C Rule Management • Rule definitions can be defined in metadata “templates”, automatically applied to attributes when raw data imported • Rules can also be created, managed using a GUI form

  14. Q/C Flag Assignment • Q/C criteria evaluated to assign/clear flags when: • Metadata template applied or Q/C criteria edited • New data records, columns added • Values edited (GUI) or columns updated (CLI) • Evaluation function (dataflag) invoked directly • Flags can also be assigned/cleared manually by: • Clicking/dragging on plots with the mouse • Using a spreadsheet-like grid • Importing from text attributes (e.g. 3rd party codes) • Propagating flags from source column(s) to dependent column(s) • Manual assignment locks flags by inserting “manual” token in criteria, removing “manual” restores automatic evaluation

  15. Q/C-Aware Data Management & Analysis • Q/C flags can be visualized in data editor grid and plots • Flagged values can be selectively removed from data sets • Statistics can be generated with/without flagged values • Flags can be instantiated as coded text columns for export • Flagged, missing values can be summarized by parameter and date for metadata

  16. Q/C-Aware Data Synthesis • Flagged, missing values summarized in re-sampled data (aggregated, binned, date-time resampled), with automatic Q/C rule creation • Flags automatically “locked” when merging multiple data sets (i.e. unions) • All Q/C operations logged to processing history, reported in metadata to document lineage

  17. Implementation Scenarios • End-to-End (logger-to-scientist) • Acquire raw data from logger or file system (standard or custom import filters) • Assign metadata from template or using forms to validate and flag data • Review data and fine-tune flag assignments • Generate distribution files & plots, archive data, index for searching • Desktop data management solution • Data Pre-processing • Acquire, validate and flag raw data (on demand or timed/triggered) • Upload processed data files (e.g. csv) or value & flag arrays to RDBMS • Workflow Step • Call toolbox functions as part of another workflow process, custom program • Kepler MATLAB actor?

  18. Suitability for Real-Time Sensor Data • Good Scalability • Data volumes only limited by computer memory (tested >2 GB data sets) • Multiple instances can be run on high-end, 64bit, clustered workstations • Good flag evaluation performance in use, testing with diverse rule sets • Good scope for automation • Timed and triggered workflow implementations easy to deploy • Support for multiple I/O formats, transport protocols • Formats: ASCII, MATLAB, SQL, XML (partially implemented) • Transport: local file system, UNC paths, HTTP, FTP, SOAP • Already used for real-time GCE data, USGS data harvesting service (LTER HydroDB, CWT)

  19. Concluding Remarks • Benefits • Flexible, modular design • No qualifier vocabulary, semantics assumed – many purposes, standards • Many operations on flagged values – supports different strategies for archiving and distributing data at different processing levels • Limitations • Requires MATLAB • Rule syntax environment-specific – a more open standard would be ideal • Support for XML metadata immature (but more development planned) • More information and downloads at:http://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm This work was supported by the National Science Foundation under grant numbers OCE-9982133 and OCE-0620959

More Related