ameriflux scientific data server technology overview
Download
Skip this Video
Download Presentation
Ameriflux Scientific Data Server Technology Overview

Loading in 2 Seconds...

play fullscreen
1 / 49

Ameriflux Scientific Data Server Technology Overview - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

Ameriflux Scientific Data Server Technology Overview. Catharine van Ingen, MSFT ([email protected]) 11 December 2006. Outline. Overview Database schema Data cube structure Data staging pipeline Work in progress and future plans. Overview. Ameriflux Collaboration Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Ameriflux Scientific Data Server Technology Overview' - nau


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Overview
  • Database schema
  • Data cube structure
  • Data staging pipeline
  • Work in progress and future plans
ameriflux collaboration overview
Ameriflux Collaboration Overview

149 Sites across the Americas

Each site reports a minimum of 22 common measurements.

Communal science – each principle investigator acts independently to prepare and publish data.

Data published to and archived at Oak Ridge.

Total data reported to date on the order of 160M half-hourly measurements (includes 3 processing levels).

http://public.ornl.gov/ameriflux/

4

ameriflux scientific data server goals
Ameriflux Scientific Data Server - Goals
  • Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources
    • Simplify provenance by providing a common “safe deposit box” for assembled data
  • Interact simply with existing and emerging Internet portals for data and metadata download, and, over time, upload
    • Simplify data assembly by adding automation
    • Simplify name space confusion by adding explicit decode translation
  • Support basic analyses across the entire dataset for both data cleaning and science
    • Simplify mundane data handling tasks
    • Simplify quality checking and data selection by enabling data browsing
ameriflux scientific data server non goals
Ameriflux Scientific Data Server - Non-Goals
  • Replace the large Internet data source sites
    • The technology developed may be applicable, but the focus is on the group collaboration scale and usability
    • Very large datasets require different operational practices
  • Perform complex modeling and statistical analyses
    • There are a lot of existing tools with established trust based on long track records
    • Only part of a full LIMS (laboratory information management system)
  • Develop a new standard schema or controlled vocabulary
    • Other work on these is progressing independently
    • Due to the heterogeneity of the data, more than one such standard seems likely to be relevant
ameriflux scientific data server workflows
Large Data Archives

Local measurements

Ameriflux Scientific Data Server - Workflows
  • Staging: adding data or metadata
    • New downloaded or field measurements added
    • New derived measurements added
  • Editing: changing data or metadata
    • Existing older measurements re-calibrated or re-derived
    • Data cleaning or other algorithm changes
    • Gap filling
  • Sharing: making the latest acquired data available rapidly
    • Even before all the checks have been made
    • Browsing new data before more detailed analyses
  • Private Analysis: Supporting individual researchers (MyDB)
    • Stable location for personal calibrations, derivations, and other data transformations
    • Import/Export to analysis tools and models
  • Curating: data versioning and provenance
    • Simple parent:child versioning to track collections of data used for specific uses
databases sql server 2005
Databases: SQL Server 2005
  • All data, descriptive ancillary data and metadata held in relational databases
    • Access via SQL query
    • Metadata is important too but handled differently than data
  • While separate databases are shown, the datasets actually reside in a single database today
    • Mapping is transparent to the scientist due to dataset grouping and versioning
    • Separate databases used for performance
    • Unified databases used for simplicity
  • New ancillary data and data are staged with a temporary database
    • Minimal quality checks applied
    • All name and unit conversions
  • Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically via ODBC connection, or ?
  • Database may be replicated for availability and load balancing
data cubes sql server analysis services
Data cubes: SQL Server Analysis Services
  • A data cube is a database specifically for data mining (OLAP)
    • Simple aggregations (sum, min, max, count) can be pre-computed for speed
    • Additional calculations (median) can be computed dynamically
    • Both operate along dimensions such as time, site, or datumtype
    • Constructed from a relational database via Visual 2005 project
    • A specialized query language (MDX) is used
  • Client tool integration is evolving
    • Excel PivotTables allow simple data viewing
    • More powerful charting with Tableaux or ProClarity (commercial mining tools)
  • Management similar, but different to SQL Server
    • Most tools are shared, but actual usage differs due to functional and organizational differences
  • References:
    • Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer) by Sivakumar Harinath and Stephen Quinn
    • MDX Solutions: With Microsoft SQL Server Analysis Services by George Spofford
simple plots sql server reporting services
Simple plots: SQL Server Reporting Services
  • Server-side rendering for simple tables and charts
    • Initially developed for static rendering of database reports such as quarterly sales
    • Report layout via Visual Studio 2005 project
    • Report description language is XML and stored in a relational database
    • Accessed via ASP.NET executing in IIS
  • Actual data source can be either relational database or data cube
  • Reports can be downloaded in various formats (eg Excel, jpg).
  • Not intended for large quantities of data (>50000 points) or highly customized graphics
  • References: SQL Books Online (online product documentation)
data import export sql server integration services
Data Import/Export: SQL Server Integration Services
  • Simple GUI-based tool generation for data migration, import, export
    • Good for exploration and/or rapid script development
    • Includes several format converters
    • Eliminates the need for much of the “glue” code
  • We’ve had mixed luck with for more complex operations such as custom selection for export to flat file and/or ingesting all data from the ORNL web site.
  • References:
    • Extending SSIS 2005 with Script by Donald Farmer or Professional SQL Server 2005
    • Integration Services (Programmer to Programmer) by Brian Knight et al
schema overview
Schema Overview

Subsets will be considered in turn

l2 data decode
L2 Data Decode

Discovered column headings are represented as:

[Datumtype] [repeat][_offset][_offset][extended datumtype][units]

  • Datumtype: the measurement primary datumtype.
    • Example: TA, PREC, or LE.
  • Repeat: an optional number indicating that multiple measurements were taken at the same site and offset.
    • Example: include TA2.
  • [_offset][_offset]: major and minor part of the z offset.
    • Example: SWC_10 (SWC at 10 cm) or TA_10_7 (TA at 10.7m).
  • Extended datumtype: any remaining column text.
    • Example: “fir”, “E”, “sfc”.
  • Units: measurement units (should only be one per datumtype)
    • Example: w/m2, or deg C.
l3 l4 european data decode
L3/L4 European Data Decode

European discovered column headings can be represented as:

[qualityprefix][Datumtype][count][extended datumtype][qualitypostfix]

  • QualityPrefix: “qf_” quality flags computed at level 3 processing
  • Datumtype: measurement primary datumtype, although other text can be used (such as G for soil heat flux)
  • Count: an optional number a site-specific offset depth
    • Replaces offset from previous column decode
    • We’ve asked this be changed; likely an artifact of the original ORNL column format
  • Extended datumtype: a datumtype-specific combination of additional modifying text OR deriviation algorithm identifier
  • QualityPrefix: “qc” quality flags computed at level 4 processing

Key issue: merging the quality data in columns with the correct data value column. Not all quality columns are 1:1 with data columns

data table
Data Table
  • Data values are stored in a “tall” normalized or atomized table
  • Only valid data values are retained
  • Each row contains exactly one value; all other columns are foreign keys to other tables
  • Data are stored as single precision floating point value
  • Horizontal partitioned tables are created for BigPlot
  • Various views created to for cube building and report generation
times repeat offset
Times, repeat, offset
  • Times table algorithmically generated on half-hour intervals
    • All times are site local
    • Considering adding UTC via view with site-specific conversions or additional column
  • Repeat table monitonically increases
  • Offset table generated by discovery of staged data
  • Views of offset and times used for cube and report menu generation
    • Multiple time represented to support hierarchies
    • Offsets ordered to simplify usage
datumtype and exdatumtype
Datumtype and Exdatumtype
  • Datumtype holds primary (not “other”) data datumtypes and all ancillary datumtypes (eg LAI)
  • Columns for cube building and per-datumtype handling
    • intoCube: determines whether datumtype flows to cube
    • dailyCalc: indicates whether daily value should be average or cumulative
    • repeatCalc: indicates whether repeats are averaged, ignored, or summed when building default cube
    • siteCalc: determines how numeric site ancillary data should flowed into cube. Eg LAI is max
  • Ordered views used for cube building and report generation
quality
Quality
  • Holds datumtype specific European quality indicators
  • Algorithmically generated given quality definitions
  • Intended to be extensible
    • qualityID indicates remaining column applicability
    • Additional columns can be added as new quality metric emerge
  • Flows into cube as a quality dimension
    • Will support selection and plotting by quality
    • Still being debugged
site ancillary data
Site Ancillary Data
  • Ancillary data holds site specific properties such as biome, climate, canopy height
    • May be text or numeric
  • Separate table necessary as multiple reported measurements exist due to different sources or changes over time
  • Source column tracks provenance
  • To flow into the cube, need an algorithm to select one and only one value
    • Leaf area index uses “max”
  • Over time, could use start/stop times to generate values and treat as data
site investigator
Site, Investigator
  • Site and investigator initially created by scrape of ORNL site
    • Now maintained by manual query
  • Spline connects the two and tracks the relevant start/stop time
  • Investigator could be removed with GUID identifier to alternate (membership) database
    • Deploying CasJobs or other workflow or authentication mechanism will necessitate this
datasets sitesets
Datasets, Sitesets
  • Datasets used to expose data versions such as NEE_or_MDS or September download
  • Sitesets used to build datasets and support incremental data staging or other changes to existing data
  • Original single level versioning scheme replaced by this two level scheme due to better mix/match ability with different data sources across datumtypes
decoding tables
Decoding Tables
  • The conversion from column header to decoded data is table driven
  • Sites and column headers are discovered, then decoded.
    • Allows for multiple different representations of the same site/datumtype.
  • Checksums used to suppress load if newly scraped data file has not changed
versioning sitesets and datasets
Versioning: Sitesets and Datasets
  • A siteset is a collection of similar data from a single site
    • May include actual observations, measurements derived from observations by known algorithm, derived from observations by private (investigator specific algorithm)
    • Derived from a single source – actual measurements, download from national data set, simulation or other
      • The Ameriflux site is only one such source
    • Sitesets are primarily internal bookkeeping
  • A dataset is a collection of sitesets
    • No restriction on sitesets in the collection, although some combinations won’t make scientific sense
    • Datasets are the external scientist-facing “version”

Reduces older data churn, maps to (likely) scientist data usage, avoids even more complex queries and views

server and researcher sets
Server and Researcher Sets
  • Sitesets and datasets may be maintained by the server or maintained by a researcher
  • Server Sets are maintained on behalf of all researchers in the collaboration
    • Accessible to all researchers
    • Set change policy decisions determined by the server (administrator) on behalf of the collaboration
  • Researcher Sets are maintained by a specific researcher or small subset group of researchers in the collaboration
    • Accessible only to owning researcher and/or collaborators selected by the owner
    • Set change decisions determined by the owner
fixed appendonly and dynamic sets
Fixed, AppendOnly, and Dynamic Sets
  • Fixed sets are used to define “known good” sets for long term data curation
    • Contains data of a known quality as of a specific point in time
    • Analogous to a software release
    • Once fixed, always fixed
    • A fixed dataset may only contain fixed sitesets
  • AppendOnly sets are used to define “latest good” sets for common accesses
    • Contains data of a known quality although additional data may be added at any time
    • May not be necessary, but offers a little protection against inadvertent staging side effects
    • An appendonly dataset may contain both fixed and append only sitesets
  • Dynamic sets are used to define “latest” sets for rapid data access, active analysis or cleaning
    • Contains data of a relatively well known quality.
    • No restriction on changes
    • A dynamic dataset may contain any siteset
dataset and siteset timestamps
Dataset and Siteset Timestamps
  • Significant changes to a dataset or siteset are tracked by timestamps
    • CreateTime: dataset or siteset creation
    • LastAppendTime: last addition of new siteset to a dataset or new data to a siteset (includes new derived measurements, new measurement types and recent data additions)
    • LastModifyTime: last deletion or substitution of a siteset member of a dataset or change to older previously existing data values of a siteset
    • FixTime: dataset or siteset converted to fixed
    • DeleteTime: all dataset or siteset data deleted
  • Notes:
    • Intended to be used as “something has changed, query for more details” indicators only
    • DeleteTime retained for some period to help the “oops” cases to help scientists determine what happened in inadvertent or forgotten scenarios
    • All timestamps are server local time
dataset and siteset provenance
Dataset and Siteset provenance
  • creatorid tracks investigator that created the dataset or siteset
    • includes any software automation tracked via investigator table entry
    • prepares for investigator private sets in MyDB/MyCube
  • name is (relatively) short text string for simple identification
    • Commonly used to select dataset when building a cube or other user interaction
  • description is longer text string for reference
  • Siteset howmade is a controlled vocabulary text string to track data origins
    • Text used to simplify data viewing.
    • Initial strings:
      • ORNLsite: non-gap filled data downloaded from ORNL website
      • ANN: gap-filled data using ANN method
      • gap: gap-filled data using unknown or unspecified method
  • Siteset path is the URL (if available) that was used when downloading data
    • NULL used when URL unavailable
    • Treated as a text string only, no sanity checking or format enforcement
siteset and dataset scenarios
Siteset and Dataset Scenarios
  • Sitesets and datasets are just mechanisms, we also need policies to determine the actual behavior
  • Usage cases of interest:
    • New staged measurements
    • Existing older measurements re-calibrated or re-derived
    • Data cleaning or other algorithm changes
    • New derived measurements added
    • Gap filling
    • Making the latest acquired data available rapidly
    • Supporting individual researchers (MyDB)
    • Data version provenance: parents and children
example data version provenance
Example: data version provenance
  • When making a new version, it’s good to track the original version
  • Full provenance is possible, but very easily leads to poor usability
  • Proposal: track siteset parents
    • A siteset has exactly one parent
    • A siteset may have any number of children due to gap filling, other algorithm change applied, or new data reported
    • When a new siteset is made by combining two different sitesets, the combination can be determined by annotation only
    • When a siteset is deleted, the youngest surviving ancestor become the parent to any surviving children
    • Relationships between datasets are more complex and should be determined by the constituent siteset relations
database schema learnings
Database schema learnings
  • The tall table has both advantages and disadvantages
    • Best for building a cube and most extensible as new datumtypes or exdatumtypes et al are encountered
    • Queries are fairly complex to write and join multiple tables to specify siteset, site, datumtype, exdatumtype, offset, and repeat values
    • Views can simplify the query, but not the performance
    • Materialized views and pivot tables too restricted for practical application here
    • Build infrastructure for machine schemas and use the right one for the right job
  • 100+M rows in any table takes care
    • Clustered unique index to improve performance of common queries
    • Updates affecting many rows can fill the log; allocate 2-5x storage
  • Full backups are far simpler operationally and footnetting often a good alternative
sample cube construction
Sample Cube Construction

Visual Studio UI for construction, deployment, and browsing

datacube structure
Datacube structure
  • Structure follows tall table organization
    • Each dimension corresponds to a foreign key
    • Time has multiple hierarchies
    • Dataset replaces siteset as it is the scientist-facing concept
  • Accumulating computed measures as we discover them
    • Median, standard deviation, variance can be expensive
    • HasDailyData, HasYearlyData, DailyValue, Average cheap and handy
  • Development uses smaller subset cubes
    • Harvard Forest for L2 data and SeptemberORNL development
    • VairaWillow for L2/L3/L4 data and NovemberORNL development
    • Spreadsheets for Q/A checking
minimalist mdx query
Minimalist MDX Query

SELECT

NON EMPTY { [Measures].[Average]} ON COLUMNS,

NON EMPTY {

[Timeline].[Year To Day].[Yydoy].&[2002-001]:

[Timeline].[Year To Day].[Yydoy].&[2002-365] }

ON ROWS

FROM [NovemberORNL]

WHERE( [Datumtype].[Datumtype].&[42],[Site].[Site].&[29],

[Exdatumtype].[Exdatumtype].&[1], [Dataset].[Dataset].&[6],

[Offset].[Offset].&[0 (cm)])

Retrieves daily average GPP values for Tonzi Ranch for 2002

report generation cube query
Report Generation Cube Query

SELECT NON EMPTY { [Measures].[Average] } ON COLUMNS,

NON EMPTY { ([Site].[Site].[Site].ALLMEMBERS * [Datumtype].[Datumtype].[Datumtype].ALLMEMBERS * [Timeline].[Year To Month].[Yymmdd].ALLMEMBERS ) }

DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM (

SELECT ( { [Offset].[Offset].&[0 (cm)] } ) ON COLUMNS FROM (

SELECT ( { [Exdatumtype].[Exdatumtype].&[1] } ) ON COLUMNS FROM (

SELECT ( STRTOSET(@TimelineYear, CONSTRAINED) ) ON COLUMNS FROM ( SELECT ( STRTOSET(@SiteSite, CONSTRAINED) ) ON COLUMNS FROM (

SELECT ( STRTOSET(@DatumtypeDatumtype, CONSTRAINED) ) ON COLUMNS FROM [SeptemberORNL])))))

WHERE ( IIF( STRTOSET(@TimelineYear, CONSTRAINED).Count = 1, STRTOSET(@TimelineYear, CONSTRAINED), [Timeline].[Year].currentmember ), [Exdatumtype].[Exdatumtype].&[1], [Offset].[Offset].&[0 (cm)] )

Query used for report generation;

STRTOSET in the above retrieves report parameters

Still retrieves daily average GPP values for Tonzi Ranch for 2002

excel cube query
Excel Cube Query

SELECT NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownLevel({[Datumtype].[Datumtype].[All]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS , NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownMember({DrillDownLevel({[Timeline].[Year To Day].[All]})}, {[Timeline].[Year To Day].[Year].&[2004]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM [NovemberORNL] WHERE ([Measures].[Average], [Dataset].[Dataset].&[6], [Site].[Site].&[29], [Offset].[Offset].&[0 (cm)], [Exdatumtype].[Exdatumtype].&[1])

Query as generated by Excel Pviot Table

Still retrieves daily average GPP values for Tonzi Ranch for 2002

datacube learnings
Datacube learnings
  • It’s fairly simple to build a cube
    • We’ve built lots
    • The GUI tool is straight forward and the tutorials help
  • It’s not so simple to build a good cube
    • Incorrectly specified hierarchies can give either incorrect answers or catastrophically poor performance or both
    • Ordering all dimensions for good usability
    • Build times can be hours and/or timeout
    • Aggregations are designed in and take extra configuration
    • Once aggregations built, still need to watch performance trace to ensure that they’re used
  • Data cubes best for daily or larger time periods. Use the database for half-hour (leaf) measurements.
  • It’s easy to write a very slow MDX query
    • Not Empty very important
  • It’s really easy to write an MDX query that is hard to understand
  • Knowing more about what queries are of interest to the scientists will help us tune the existing cube and/or build targeted cubes
data staging pipeline1
Data Staging Pipeline
  • Data can be downloaded from internet sites regularly
    • Sometimes the only way to detect changed data is to compare with the data already archived
    • The download is relatively cheap, the subsequent staging is expensive
  • New or changed data discovered during staging
    • Simple checksum before load
    • Chunk checksum after decode
    • Comparison query if requested
  • Decode stage critical to handle the uncontrolled vocabularies
    • Measurement type, location offset, quality indicators, units, derivation methods often encoded in column headers
    • Different sites use different units
  • Incremental copy moves staged data to one or more sitesets
    • Automated via siteset:site:source mapping
updates to basic data checks
Updates to Basic Data Checks
  • Applied checks will be constantly increasing
    • Waiting for descriptions of current ORNL checks
  • Investigations pending:
    • Per-site, per datumtype limits (eg Summer is hotter Florida than in Alaska)
    • Systematic time errors (eg GMT rather than local time)
    • Time dependent checks for datumtypes max, min, zero at night or day day
    • Unit conversions or other simple conversions (eg relative humidity)
  • Add a new table to log results of all checks
    • Enables standardized reporting longer term
    • Can be copied to active database as part of provenance metadata for user access
    • Schema TBD
copying only changed data
Copying Only Changed Data
  • Proposed approach: throw processing at the problem
  • File checksum can be used for older, retired sites or slowly reporting sites
    • Only addresses the “no new or changed data” case
    • Recommend as a filter to suppress load rather than suppress copy.
  • Time (yearly?) chunked count and checksum can be used for older, unlikely to change data
    • Reduces the total data that must be explicitly compared
    • Recommended given the likely query time for full comparison
  • Exhaustive timestep-by-timestep comparison between data in archive database and staging database used to determine the amount of change
    • Limit amount of returned data to bound query time
    • Compare only makes sense with“like” data (same processing and gap fill method)
  • Desired behavior for each site:
    • If the target siteset is fixed, do nothing
    • If the target siteset is append only and the data are not changed, append only the new data
    • If the target siteset is flexible, append new data and replace any changed data
    • Otherwise, create a new siteset and copy all data
l3 l4 european quality and gap filled data
L3/L4 European Quality and Gap-Filled Data
  • Additional column headings to denote quality flags, derived quantity algorithm and gap fill algorithm
    • Example: Rg_f, Rg_fqc, qf_NEE_st, NEE_st_fNDS, NEE_st_fMDSqc
      • _f indicates fill
      • qc indicates quality
      • st, or, MDS, and ANN indicate algorithms
  • Approach: load, decode, convert, delete
    • Loading unchanged: quality flags treated as data values.
    • Decoding upgraded: additional column headers decoded and flag/data indicator added
    • Convert: builds quality flag row and connect with the corresponding measurement
    • Delete: removes quality flag rows from the data table prior to data copy from the staging database
work in progress improving current capabilities
Work in progress: improving current capabilities
  • Debug and deploy quality flags on L3/L4 data
  • Implement incremental data staging to enable speedy and simple data editing by an actual scientist (rather than a programmer)
  • Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions
  • Investigating integration with ArcGIS to get spatial data analyses
  • Investigating MatLab, R2, and SPlus integration
    • Direct SQL queries available from MatLab and R2
    • Given the query complexity, should we / can we build a wizard?
  • Get BigPlot working again and invest in usability
    • Last summer’s effort was really research into feasiblity
future plans transitioning from prototype
Future Plans: transitioning from prototype
  • Add data download
    • Current tablular report intended as stopgap/technology demonstration only
    • Likely starting point: adapting Euroflux collaboration download to ftp zip file given the good usability and compression
    • Alternate would be to write CSV file converter via SSIS
  • CasJobs and MyDb to support individual scientist analysis including data editing for correction, recalibration, etc
  • Implement expanded metadata handling to enable scientists to add site characteristics and sort sites on those expanded definitions
  • Automate MyCube construction to pair with MyDb
    • Internal Microsoft experience with real life data suggests this is possible and may even be fairly straight forward for cubes with with common dimensions.
  • Review current deployment for true DMZ deployment
longer term futures
Longer Term Futures
  • Handling imagery and other remote sensing information
    • Curating images is different from curating time series data
    • Using both together enables new science and new insights
    • Graphical selection and display of data
  • Support for user specified calculations within the database
    • We’ve done LE_Pot, but there are clearly others
  • Support for direct connections to analysis and statistical packages to cube as well as database
    • Leverage stored procedure (SQL) and nice wizards (MDX/SQL) to simplify query generation
  • Linkage with models
    • Additional (emerging) data standards such as NetCDF
    • Handling “just in time” data delivery and model result curation
    • Need for workflow and single sign on credentials
  • Data mining subscription services
  • Handling of a broader array of data types
ad