1 / 27

Explaining Query Answers with Explanation-Ready Databases

Explaining Query Answers with Explanation-Ready Databases. Sudeepa Roy Laurel Orr Dan Suciu. This talk. Given a database D an aggregate query Q the output(s) Q[D ] a question on the output(s) q “why x is high/low/higher than y etc.”

cvogel
Download Presentation

Explaining Query Answers with Explanation-Ready Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Explaining Query Answers with Explanation-Ready Databases Sudeepa Roy Laurel Orr Dan Suciu

  2. This talk • Given • a database D • an aggregate query Q • the output(s) Q[D] • a question on the output(s) q • “why x is high/low/higher than y etc.” • How do we provide meaningful explanations?

  3. dir = Directorate (e.g. CS) div = Division (e.g. IIS for DB) The NSF Example Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) D Find top-5 univs with the highest total $$$ in CS from 1990 SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC Q But I remember a news article… Q[D] [http: //www.nsf.gov/awardsearch/download.jsp]

  4. The NSF Example Give me an Explanation! q[D] why is q: Q[D, UIUC] – Q[D, CMU] high?

  5. Donald Rubin David Hume (1711-76) Karl Pearson (1857-1936) Judea Pearl Aristotle (384-322 BC) Carl Gustav Hempel (1905-97) In search for a “causal” explanation by “intervention”… Make changes to the world (a.k.a. database) If it affects the outcome (a.k.a. query answer) you have a good explanation (to your question)

  6. Explanation by Intervention • Candidate explanation Φ • The change it causes to the database is the interventionΔΦ ⊆ D • Goal: find top explanations by q[D]-q[D - ΔΦ] q[D] Q • D q[D-ΔΦ] ΔΦ UIUC CMU Intervention

  7. Related Work on Explanations [Scorpion, Wu-Madden, VLDB 2013] [Roy-Suciu, SIGMOD 2014] • Both restricted to • Φ : simple conjunctive predicates • single block queries Q SELECT time, AVG(Temp) FROM readings GROUP BY time Q.Why is there a peak for #SIGMOD papers from industry around2000? Ans. Φ: [inst= ‘bell-labs’] 100 Q[D] = original avg(temp) at time 12 pm AVG(Temp) Q(D - ΔΦ) = NEW avg(temp) at time 12 pm 50 allows dependency between relations delete author ⇒ delete paper delete paper delete author 12 Q. Why is AVG(temp.) at 12pm so high? Ans. Φ: [sensor = 3] materializes the join before looking for explanations Ack: Thanks to Eugene Wu for the example!

  8. NSF example again... Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) • Previous work considers predicates only from Award and Institution • [div = ‘ACI’] (cyber infrastructure) • $893M for UIUC and $26M for CMU • What about the Investigator table? • Investigator not in query, so is ignored • [name = ‘RP’] from UIUC • Received > $580M from NSF SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC

  9. There are explanations beyondsimple conjunctive predicates Explanations of the form: • Awards with >= 3 PIs • PIs with >= 5 awards • PIs with total >= $10 M money • PIs with >= 10 co-PIs • Top-10 PIs in terms of money cannot be handled by previous approaches The concept of intervention supports more general explanations

  10. Our Contributions • Formalize richer explanations beyond simple conjunctive predicates • By explanation-ready databases • Rank these explanations efficiently • Support queries beyond single-block queries

  11. General Framework for Explanations Φ: “PIs with >= $10 million award” R1: Award(aid, amount, title, year, startdate, enddate, dir, div) R2: Institution (aid, inst, address) R3: Investigator (aid, name, emailID) Δ1,Φ Δ3,Φ Step 1: Find investigators with this property Δ3, Φ • Step 2: Find tuples from other tables affected • by Δ3, Φ • Δ1, Φ = Awards by investigators in Δ3, Φ • Δ2, Φ= ∅ • Can be computed by queries too • Φ complex than conjunctive predicates • requires other tables • but can be computed by a query Intervention of Φ = (Δ1, Φ, Δ2, Φ,….., Δk, Φ)

  12. Explanation-Ready Databases (ERD): 1/2 Three explanation ids: Φ∈ {1, 2, 3} k = 2 • Fix a small set of templates (NSF: ~8) • Get their candidate explanations by running SQL queries (NSF: ~188k) • For all candidate explanations Φ, pre-compute by SQL queries: ΔΦ = (Δ1, Φ , Δ2, Φ ,….., Δk, Φ) • Store them in k tables (Δ1, Δ2,….., Δk) • indexed by a new attribute Φ, description stored separately Δ1 R1 intervention for Φ = 2 R2 Δ2

  13. Explanation-Ready Databases (ERD): 2/2 Three explanation ids: Φ∈ {1, 2, 3} • The new database (D, Δ) is ready to find good explanations = ERD • Given a query Q and a question q: • Compute q[D - ΔΦ] for all Φ • Output the top ones by their interventions on the question q[D] - q[D - ΔΦ] Δ1 R1 intervention for Φ = 2 R2 Δ2

  14. An ERD (D, Δ) depends only on D • ΔΦ depends on Φ and semantic of D • ERD does not depend on query Q or user question q • The ERD for the NSF dataset can be used to answer other questions: Query 1 Query 2 Question 2: Why avg $ of ACI high? Question 1a: Why UIUC $ > CMU $ Question 1b: Why UCSD $ > MIT $

  15. Explanation query q* on ERD Goal: Given ERD (D, Δ), compute q[D - ΔΦ] for all Φ (to rank them) Naïve method: Iterate over all Φ Compute q[D - ΔΦ], possibly by using “incremental view maintenance” (IVM) But, large number of Φ-s : “For loop” slow Our approach: simultaneously evaluate all Φ-s by running one explanation queryq*[D, Δ]

  16. Tables as “Annotated Relations” • Each tuple maps to a numeric annotation (initially 1) • As query progresses, annotations are propagated • through a query plan • Simultaneously for all Φ-s Δ D Initially: 1 1 1 1 1 1 1 1 1

  17. Invariant maintained by q* • For all explanations Φ • For all partial or final query q Restriction to Φ q[D ⊖ ΔΦ] q[D] ⊖ q*Φ[D, Δ] = Difference considering both set and annotations i.e. first apply intervention of Φ then evaluate the query ≣ first evaluate the query and explanation query, then compute the intervention by restricting to Φ

  18. In the paper… • Addition ⊕, difference ⊖, intersection ⊙ for annotated relations • Rules and algorithms for all operators maintaining the invariants • Examples and necessary restrictions on the inputs

  19. ERDs can go beyond single-block queries • e.g. can handle trend : slope in linear regression on a derived measure WITH R AS (SELECT time as x, AVG (Revenue - Expenses) as y FROM Sales WHERE Year >= 2007 and Year <= 2010 GROUP BY Year) SELECT ((COUNT(R.x) * SUM(R.x*R.y)) – (SUM(R.x) * SUM(R.y))) / ((COUNT(R.X)*(SUM(R.x*R.x))- (SUM(R.x))*(SUM(R.x)))) FROM R

  20. IVM = Incremental View Maintenance Experiments • Algorithms: • SingleQ-IVM = Our explanation query • Iter-IVM = Iteratively run similar IVM • Naïve-Iter = Iterative naïve approach • Naïve-SingleQ = Run a single query but with expensive cross-product • Parameters: • |D| = n • No. of explanations |Φ| • Average intervention size for explanations: avgΦ|ΔΦ| • Complexity of queries (group by / single output, nested)

  21. Experiments • Datasets: • (performance) synthetic datasets built from “Natality” dataset (4M rows, 233 attributes) • (qualitative evaluation) “NSF” dataset • Award (400k), Institution (419k), and Investigator (621k) • |Φ| = 188k of 8 different types • |Δ| = 1.74M • Also “Flights” dataset (in the paper)

  22. Experiments: # of tuples naive iterative IVM our algorithm original query Q[D] • The running time of our algorithms do not increase rapidly with #tuples • Until the final join to compute difference Q[D] – Q[D- Δ]

  23. Experiments: # of explanations iterativeIVM (rapidly increases with #explanations) original query Q[D] our algorithm Q[D- Δ] our algorithm Q[D] – Q[D- Δ] • The running time of our algorithms do not increase rapidly with #explanations • Computing difference Q[D] – Q[D- Δ] is faster than Q[D- Δ] • can even be faster than the original query

  24. Experiments: Group size |ΔΦ| • Running time is bad when group size is ~100k for all 1k explanations • |D| = 4M, but |Δ| = 50M! • In practice not all explanations are likely to have huge ΔΦ

  25. [Ahmad et al. 2012] Experiments: IVM (DBToaster) vs. ERD IVM: Compute Q[D - Δ] Actual deletion fixed query dynamic updates • Iteratively run DBToaster, a state-of-the-art IVM system • ERD outperforms iterative IVM even when time to read data in IVM and compilation of the input query are ignored ERD: Compute Q[D – Δ1], Q[D – Δ2], …, Q[D – ΔP] Hypothetical deletion, dynamic queries fixed updates our algorithm

  26. Explanations for NSF example Top explanations: • Investigators with more than 100M awards in total (there were none in CMU) • Investigators with ≥ 11 and 7-8 awards • Investigators from UIUC ‘RP’ and ‘TD’ (who share some awards) • Awards with 6 or more Investigators • Each contributes to more than $400-850 million difference in $$$ between UIUC and CMU • The above explanations are diverse • They give the insight that a few researchers in UIUC got a number of big awards (in cyber-infrastructure - ACI) • Could not be obtained if we were restricted to simple predicates from the Award and Institution tables and ignored Investigators altogether

  27. Summary • We need to put effort in drawing interesting insights from data that can answer our questions • We propose ERDs that can handle • complex explanations • complex queries • However, there are restrictions, limitations (bag, negation), space overhead, manually defining templates • scope for future work Thank you

More Related