1 / 17

Computing Provenance and Annotations of Derived Data

Computing Provenance and Annotations of Derived Data. Wang-Chiew Tan UC Santa Cruz. Provenance of data. When you see some data on the Web, do you know where it came from? why it is there?

fawn
Download Presentation

Computing Provenance and Annotations of Derived Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz

  2. Provenance of data • When you see some data on the Web, do you know • where it came from? • why it is there? • This information (provenance) is typically lost in the process of copying/transcribing/transforming databases • Loss of provenance is an acute problem in some scientific databases

  3. flow of data Complex interdependencies (Example from scientific databases) • Various problems: • Trace provenance of data • Propagate annotations GERD TRRD EpoDB BEAD Swissprot GAIA EMBL GenBank Transfac DDBJ

  4. (Why-provenance) Why? (Where-provenance) Where? Two kinds of provenance NYRestaurants (Source table) NYHotels (Source table) Cost Type Restaurant Zip Rating Hotel Zip Peacock Alley $$$ French 10022 4.5 10022 Waldorf Astoria Bull & Bear $$$ Seafood 10022 Holiday Inn DT 10013 4.0 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 JOIN, PROJECT View Restaurant Hotel Rating Cost $$$ Waldorf Astoria Peacock Alley 4.5 Bull & Bear 4.5 $$$ Waldorf Astoria Waldorf Astoria $ Soho Kitchen & Bar 4.5 Pacifica $ Holiday Inn DT 4.0

  5. SDSS - Sloan Digital Sky Server Select Specobj.z, photoobj.g, photoobj.r From Specobj, photoobj Where Specobj.objid = photoobj.objid and Specobj.specclass = 3 and Specobj.zconf > .95

  6. Compute provenance • Question: Suppose a database is created by a query. Can we compute the why and where provenance of an element? • Answer: Computing provenance (both why and where) is NP-hard in general.

  7. Annotations • Adds value to data • knowledge sharing : annotations can be read & reviewed by independent parties • Annotations are loosely structured • Annotations on data at various levels of granularity, annotations on annotations • Source Data: • proprietary • fixed schema • A system that overlays annotations on existing data • Useful tool for scientific databases • Annotations should spread back to the source and forward to other databases

  8. Serves fine French Cuisine in elegant setting. Jackets required. Cost Type Restaurant Pacifica $ Chinese $ Soho Kitchen & Bar American Cost Type Restaurant Extensive wine list! Peacock Alley $$$ French Bull & Bear $$$ Seafood Pacifica $ Chinese $ Soho Kitchen & Bar American Propagating annotations NYRestaurants (Source Table) Cost Type Restaurant Zip Peacock Alley $$$ French 10022 Bull & Bear $$$ Seafood 10022 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 Yummy chicken curry!! Cheap Restaurants (View 2) All Restaurants (View 1)

  9. Location and Propagation Rules relation name tuple in R A is an attribute in schema of R • A location is a triple: (R, t, A) • Propagation Rules: • Select: • Project: • Join: • Union: A1 A2 A3 A1 A2 A3 R A1 A2 A3 A3 R A1 A2 A2 A3 A1 A2 A3 R2 R1 A1 A2 A3 A1 A2 A3 R1 A1 A2 A3 R2

  10. Query Computing annotation propagation Model: • Question: Suppose a database is created by a query over some source data, can we compute how to propagate an annotation on a data element back to the source with minimum side-effects? • Answer: Computing the minimum side-effect annotation is NP-hard in general Source: Relational Database View : result of query applied on source

  11. Related Work on Annotations (not exhaustive!) • Superimposed Information (D. Maier, L. Delcambre [WebDB’99]) • data “placed over” existing information e.g. bookmark files, schema of a database • Annotation Systems • Annotea (W3C) • annotate web pages • Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project) • annotate on PDF files, HTML, etc. • BioDAS (Distributed Annotation Server) (L.Stein et al. ) • annotate on genome sequences • No one has formally studied annotation placement problem

  12. Provenance and Annotations • Where-provenance & annotation placement • where should the annotation be placed in the source in order to propagate the annotation to view data d ? • Annotate the source data in one of the source locations in the where-provenance of d • Provenance & Archiving • trace a piece of data to its correct source version • Why-provenance & view deletion • which source data should be deleted in order to delete view data d ? A combination of source data that altogether “disable” every witness for d

  13. How do we attach annotations to data? • Relational tables: Identify a particular column of a particular table of a particular relation: (R, t, A) • Tree-like data: Need a canonical path to the data element A R t

  14. Lots more to do! • Further study on provenance for queries that involve negation, aggregates select sum(sal) from Employee where sal > 50K • Handle “irregular” annotations and on tree-like data. • How about databases which are manually constructed and annotated? • Organize data with keys • Use of constraints and special cases to derive efficient algorithms for propagating annotations back • Language specific issues

  15. =a [Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”] • Equivalent queries in the same language, but different annotation behavior Q1= SELECT e.Name, e.Sal FROM Empe WHERE e.Sal = “50K” Q2= SELECT e.Name, “50K” AS Sal FROM Emp e WHERE e.Sal = “50K” [Name:”Joe”, Sal:50k ] [Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”] Inconsistencies in “annotation-aware” language(s) Emp Department Name Sal Dept Joe 50K Marketing Dept Manager Marketing Jane • The same query in different languages, but different annotation behavior Relational Algebra: Emp JOIN Department SQL: SELECT e.Name, e.Sal, e.Dept, d.Manager FROM Empe, Department d WHERE e.Dept = d.Dept [Name:”Joe”, Sal:50k]

  16. Do we need an “annotation-aware” QL? • Relational algebra suggests a natural set of propagation rules • SQL suggests another natural propagation rule • based on variable bindings • Question: Can we extend/design the the query language(s) so that • Equivalent queries have the same annotation behavior • Translation of a query from one language (e.g. SQL) into another (e.g. relational algebra) yields the same annotation behavior • Perhaps a more fundamental question... • Should a query language be “annotation-aware” ? • Perhaps we should have language constructs to allow the user to explicitly control annotation propagation?

  17. End

More Related