1 / 17

Emergent Semantics: Towards Self-Organizing Scientific Metadata

Emergent Semantics: Towards Self-Organizing Scientific Metadata. Bill Howe, David Maier Oregon Health and Science University.

perry-sloan
Download Presentation

Emergent Semantics: Towards Self-Organizing Scientific Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emergent Semantics: Towards Self-Organizing Scientific Metadata Bill Howe, David Maier Oregon Health and Science University

  2. “The file ‘anim-sal_estuary_7.gif’ is a data product derived from the output of the ELCIRC simulation program run for the period January 8-15 2002. The image shows salinity (practical salinity units) in the estuary region of the domain. It’s actually an animation, where each frame is a horizontal slice 7 meters below the mean sea level. There are 96 frames, each representing 15 minutes.” program = ELCIRC simStart = 1/8/02 simEnd = 1/15/02 region = estuary variable = salinity timesteps = 96 plottype = animation Oregon Health and Science University

  3. Environmental Observation and Forecasting System • Daily forecasts and 1000s of ad hoc hindcasts • One simulation involves ~20k files: • inputs, parameters, outputs, derived data products • This scale mandates: • query access rather than simple filesystem browsing • Automation everywhere Oregon Health and Science University

  4. Tasks • Collect metadata. • Organize collected metadata. • Publish organized metadata for querying. Oregon Health and Science University

  5. Challenges • Metadata is scattered • in file paths • within file headers • in “nearby” files • Metadata requirements change frequently • new simulation codes • new data product types • new users, internal and external Depth = “7” Variable = “Salinity” …/anim-sal_estuary_7.gif Type = “Animation” Region = “Estuary” Oregon Health and Science University

  6. “Obvious” Solution • Data Managers work with Domain Experts • design a relational schema, load data, test, repeat file • But: • Large up-front cost to DB design • Slow return on investment • Use cases unknown • Significant change is anticipated • DB languages/APIs not necessarily within scientists’ skill set data product region Oregon Health and Science University

  7. Alternative Solution: Steps 1-3 • Harvest metadata via simple collection scripts written by the domain experts • Use RDF as a schema-independent metadata representation • Use RDBMS technology for storage and management 1. Collection scripts filesystem 3. db 2. rdf Oregon Health and Science University

  8. A Narrower Interface SQL statements Database APIs Load Strategies Data formats/models rich schema filesystem Collection scripts generic schema filesystem RDF triples Oregon Health and Science University

  9. Generic RDF Schema Oregon Health and Science University

  10. Is Generic RDF Good Enough? “Find files with region, plottype, and variable descriptors” SELECT r.subject as file, r.object as region, p.object as plottype, v.object as variable FROM statements r, statements p, statements v WHERE r.subject = p.subject AND p.subject = v.subject AND r.property = ‘property:region’ AND p.property = ‘property:plottype’ AND v.property = ‘property:variable’ 3 self-joins! Oregon Health and Science University

  11. Decomposed Data • So we can query the RDF directly, but… • …no grouping structures to aid query formulation and processing. • Automatically infer groupings from the RDF data, observing that related files often share signatures. • Let users impose groupings using a web interface (like views) db ... <isofar.gif, type, isoline>, <isofar.gif, region, far>, <animsal.gif, timesteps, 10>, <animsal.gif, var, salt>, ... filesystem plot animation Oregon Health and Science University

  12. Alternative Solution: Steps 4-6 • Partition descriptors into equivalence classes based on file signatures • Expose signatures via the web to facilitate browsing and querying • Recompute signature extents as new metadata is integrated 4. partition data 5. publish to the web db website 6. query and browse via profiles Oregon Health and Science University

  13. The set of properties defined for a particular file Oregon Health and Science University

  14. Signatures • A file’s signature is just the set of properties used to describe it. • If signatures were fixed, we might derive a relational schema from them. Instead, we need to respond to changes 4. partition data db find signatures compute signature extents Oregon Health and Science University

  15. Example: Consolidate Files with Similar Signatures • Modify schema (DM) • Transfer tuples from A to B (DM) • Modify collection programs • Modify extraction routines (DE) • Modify Internal organization (DE) • Modify SQL statements (DM) Oregon Health and Science University

  16. Alternative • Change two lines in a collection script (DE) Assert(fileA, “animation”, “”) Assert(fileA, “plottype”, “animation”) Assert(fileB, “plottype”, “animation”) • Reload data (Automatic) • Recompute Signatures (Automatic) • Republish data (Automatic) Oregon Health and Science University

  17. Benefits • Narrow interface between data creators and data managers • Metadata exploitable prior to finalizing a thorough schema • Derived schema can adapt to changing requirements automatically • Profiles constitute emergent semantics: meaning is assigned after data is collected. Oregon Health and Science University

More Related