140 likes | 222 Views
Explore the importance of provenance in data management, focusing on practice and interoperability rather than standards. Discover how provenance benefits various industries and scientific research. Discuss solutions, differentiations, and hot topics in the field.
E N D
6th e-Infrastructure Concertation Lyon 24 Nov 2008 “provenance” DATA TRACK Chair : Krystyna Marek Rapporteur: Wolfram Horstmann
Motivation • Last two meetings were on standards • It was proposed to have a more focussed discussion • Focus on practice and interoperability rather than standards • Select an arbitrary but important topic
Notions of Provenance • Where do data objects* originate from? • Scientific Work -- examples • Instrumentation techniques • Manufacturers of hard- and software • Methodologies • Processes, e.g. gene sequencing • Technical/Local -- examples • (web)-identifiers • Database, repository name * Primary data, documents, metadata …
Why Provenance? • Quoting / Citing / Referencing as global scientific principle • „Reproducible research“ • Giving credits to authors / creators in distributed environments • Original location / context has to be known • Experienced in Grid-Environments [1]
Provenance & Interoperability • Re-Use / Sharing: “Addressing/Accessing” • Common view, common use • Unidirectional: No change of data objects! • Federation: “Discovering in Context” • Remote representation of distributed DOs • Aggregation: “Contextualizing” • Add unchanged object in a context • Processing/Annotation: “Changing” • Uni- vs. Bidirectional: Change of DOs and remote representation vs. back-storage (e.g. CVS)
IVOA • Astronomy area: Repositories use OAI-PMH to provide general • Provenance as kind of metadata • „Observation data model“ • History of data (process „lineage“) • Processing • Configuration: telescope, camera • Ambient condiditions: temperature etc. • Versioning is included (also algorithms etc.)
MetaFor • Data from numerical models • Descriptive information from model • Models are often transformed • Database / Registry for models in distributed repositories
D4Science • Framework for • More than simple import framework • Graphs representing provenance information • Thematic: fishing site / statistic /
DRIVER • Focus on document repositories • Some 100 … • Simple Provenance • OAI-PMH • Further (2nd order) Provenance • OAI-PMH („about“): repository identifiers • Enhanced Publications >> OAI-ORE • Semantic Model (named graphs) representing packages of documents and data objects
Solutions • Provenance • Registries for curator, publisher etc. • Resolving over registry • Diversity of approaches • CIDOC-CRM, OPM, EuroStats, • Languages: RDF / OAI-ORE
Differentiations • Expertise from Data-Centers as opposed to Data-Providers • Infrastructures should provide functions to add provenenace information (but do not) • e.g. EGEE provides an additional module for recording provenance data
Hot topics • Propagating provenance: versioning • Disambiguation / Deduplication • different identical objects • Who provides the data? • Each processing step should provide at least some metadata
Recommendations for Infrastructure • Standards for Provenance: Non-existing? • Each processing step should provide at least some metadata • Look deeper into specific implementations in subject communities • Technical point to point organisation • Bilateral • Programming a meeting • 24/25th ESA: earth science meeting?