1 / 31

Approximated Provenance for Complex Applications

Approximated Provenance for Complex Applications. Susan B. Davidson University of Pennsylvania Eleanor Ainy , Daniel Deutch , Tova Milo Tel Aviv University. Crowd Sourcing. The engagement of crowds of Web users for data procurement and knowledge creation. 2. Why now?.

oleg-bowers
Download Presentation

Approximated Provenance for Complex Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximated Provenance for Complex Applications Susan B. Davidson University of Pennsylvania Eleanor Ainy, Daniel Deutch, Tova Milo Tel Aviv University

  2. Crowd Sourcing The engagement of crowds of Web users for data procurement and knowledge creation. 2

  3. Why now? We are all connected, all the time! 3

  4. Complexity? • Many of the initial applications were quite simple • Specify Human Interaction Task (HIT) using e.g. Mechanical Turk, collect responses, aggregate to form result. • Newer ideas are multi-phase and complex, e.g. mining frequent fact sets from the crowd (OASSIS) • Model as workflows with global state

  5. Outline • “State-of-the-art” in crowd data provenance • New challenges • A proposal for modeling crowd data provenance

  6. Outline • “State-of-the-art” in crowd data provenance • New challenges • A proposal for modeling crowd data provenance

  7. Crowd data provenance? • TripAdvisor: aggregates reviews and presents average ratings • Individual reviews are part of the provenance • Wikipedia: keeps extensive information about how pages are edited • ID of the user who generated the page as well as changes to page (when, who, summary) • Provides several views of this information, e.g. by page or by editor • Mainly used for presentation and explanation

  8. Outline • “State-of-the-art” in crowd data provenance • New challenges • A proposal for modeling crowd data provenance

  9. Challenges for crowd data provenance • Complexity of processes and number of user inputs involved • Provenance can be very large, leading to difficulties in viewing and understanding provenance • Need for • Summarization • Multidimensional views • Provenance mining • Compact representation for maintenance and cleaning

  10. Summarization • Large size of provenance  need for abstraction • E.g., in heavily edited Wikipedia pages: • “x1, x2, x3 are formatting changes; y1, y2, y3, y4 add content; z1 , z2 represent divergent viewpoints” • “u1 , u2 , u3 represent edits by robots; v1, v2 represent edits by Wikipedia administrators” • E.g., in a movie-rating application to summarize the provenance of the average rating for “MatchPoint” • “Audience crowd members gave higher ratings (8-10) whereas critics gave lower ratings (3-5).”

  11. Multidimensional Views • “Perspective” through which provenance can be viewed or mined • E.g. in TripAdvisor, if there is an “outlier” review it would be useful to see other reviews by that person to “calibrate” it. • “Question” perspective could show which questions are bad/unclear

  12. Maintenance and Cleaning • May need update propagation to remove certain users, questions and/or answers • E.g. spammers or bad questions • Mining of provenance may lag behind the aggregate calculation • E.g., detecting a spammer may only be possible when they have answered enough questions, or when enough answers have been obtained from other users.

  13. Outline • “State-of-the-art” in crowd data provenance • New challenges • A proposal for modeling crowd data provenance

  14. Crowd Sourcing Workflow Movie reviews Aggregator Platform

  15. Provenance expression

  16. Propagating provenance annotations through joins R A B C R ⋈ S A B C D E JOIN (on B) S D B E • The annotation p*rmeans joint use of data annotated by p and data annotated by r [Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

  17. Propagating provenance annotations through unions and projections R A B C πABR A B PROJECT +means alternative use of data, which arises in both PROJECT and UNION. [Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

  18. Annotated Aggregate Expressions Q = select Dept, sum(Sal) from R group by Dept R The sum salary for d1 could be represented by the expression (20 p1 + 10 p2 + 15 p3) Eid Dept Sal ⊗ ⊗ ⊗ This provenance aware value “commutes” with deletion. [Amsterdamer, Deutch, Tannen, Provenance for Aggregate Queries. PODS 2011]

  19. Provenance expression

  20. Provenance expression: Benefits • Can understand how movie ratings were computed. • Can be used for data maintenance and cleaning • E.g. if U2 is discovered to be a spammer, “map” its provenance annotation to 0

  21. Summarizing provenance • Map annotations to a corresponding “summary” • h: Ann  Ann’, where |Ann’| << |Ann| • E.g. in our example, let • h(Ui)=h(Si)=1, h(Ai)=A, h(Ci)=C • Reducing the expression to • Which simplifies to

  22. Constructing mappings? • How do we define and find “good” mappings? • Provenance size • Semantic constraints (e.g. two annotations can only be mapped to the same annotation if they come from the same input table) • Distance between original provenance expression and the mapped expression (e.g. grouping all young French people and giving them an average rating for some movie)

  23. Conclusions • Provenance is needed for crowd-sourcing applications to help understand the results and reason about their quality. • Techniques from database/workflow provenance can be used, but there are special challenges and “opportunities”

More Related