1 / 34

Efficient Management of Inconsistent and Uncertain Data

Efficient Management of Inconsistent and Uncertain Data. Renée J. Miller University of Toronto. Contributors. Ariel Fuxman , PhD Thesis Microsoft Search Labs Jim Gray SIGMOD 2008 Dissertation Award Periklis Andritsos, PhD Jiang Du, MS Elham Fazli, MS Diego Fuxman, Undergrad.

shadow
Download Presentation

Efficient Management of Inconsistent and Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto

  2. Contributors • Ariel Fuxman, PhD Thesis • Microsoft Search Labs • Jim Gray SIGMOD 2008 Dissertation Award • Periklis Andritsos, PhD • Jiang Du, MS • Elham Fazli, MS • Diego Fuxman, Undergrad

  3. Dirty Databases No. I don’t see Any problem with the data • The presence of dirty data is a major problem in enterprises • Traditional solution: data cleaning

  4. Limitations of Data Cleaning • Semi-automatic process • Requires highly-qualified domain experts • Time consuming • May not be possible to wait until the database is clean • Operational systems answer queries assuming clean data

  5. Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology

  6. Why is this Business Intelligence? • Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. • The goal of BI is to support better decision making, based on information. • DBMS should provide meaningful query answers even over data that is dirty

  7. Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions

  8. Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions

  9. A Data Integration Example Sales Integrated Customer Database Shipping Customer Support Web Forms Demographic Data Integrating customer data…

  10. Matching and Merging Matching and merging are two fundamental tasks in data integration Web Sales

  11. True Disagreement Between Sources What’s Peter’s salary? Web Sales

  12. Inconsistent Integrated Databases SATISFY custid KEY VIOLATES custid KEY Inconsistent Integrated Database Web Sales In the absence of complete resolution rules…

  13. Query: “Get customers who make more than 100K” Querying Inconsistent Databases Example: Offering a Platinum credit card… Peter,Paul,Mary Are we sure that we want to offer a card to Peter? web sales sales/web web sales

  14. Aggressive: Get customers who possibly make more than 100K Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K Paul, Mary Querying Inconsistent Databases

  15. Formal Semantics • Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98] • Possible world: “complete” databases • Consistent answers • Proposed by Arenas, Bertossi, and Chomicki in 1999 • Corresponds to conservative semantics • Possible world: “consistent” databases

  16. Consistent Answers Repairs Inconsistent database web sales sales/web web sales Key: custid

  17. Consistent Answers Query=“Get customers who make more than 100K” Repairs q CONSISTENT ANSWER= {Paul,Mary} CONSISTENT ANSWERSAnswers obtained no matter which repair we choose q q q

  18. Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions

  19. When We Started… • Semantics well understood • Problem • Potentially HUGE number of repairs! • Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] • Few tractability results [Arenas et al. 99, Arenas et al. 01] • Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03] • Expressive queries and constraints • Computationally expensive • Applicable only to small databases with small number of inconsistencies

  20. Our Proposal: ConQuer SQL query q Keys Consistent answertoq ConQuer’s Rewriting Algorithm Commercial database engine Inconsistent database Rewritten SQL query Q*

  21. Class of Rewritable Queries • ConQuer handles a broad class of SPJ queries with • Set semantics • Bag semantics, grouping, and aggregation • No restrictions on • Number of relations • Number of joins • Conditions or built-in predicates • Key-to-key joins • The class is “maximal”

  22. Why not all SPJ queries? • Some SPJ queries cannot be rewritten into SQL • Consistent query answering is coNP-complete even for some SPJ queries and key constraints • Maximality of ConQuer’s class • Minimal relaxations lead to intractability • Restrictions only on • Nonkey-to-nonkey joins • Self joins • Nonkey-to-key joins that form a cycle

  23. Example: A Rewritable Query • TPC-H Query 10 SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-10-01' and o_orderdate < date('1993-10-01') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc

  24. Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency

  25. Experimental Evaluation • Goals • Quantify the overhead of the rewritings • Assess the scalability of the approach • Determine sensitivity of the rewritten queries to level of inconsistency of the instance • Queries and databases • Representative decision support queries (TPC-H benchmark) • TPC-H databases, altered to introduce inconsistencies • Database parameters • database size • percentage of the database that is inconsistent • conflicts per key value (in inconsistent portion)

  26. Scalability Worst Case 5.8 overhead Selectivity 98.56 % 5 % inconsistent tuples 2 conflicts per inconsistent key value Best Case 1.2 overhead Selectivity 0.001 % Size (GB)

  27. Contributions – Theory • Formal characterization of a broad class of queries • For which computing consistent answers is tractable under key constraints • That can be rewritten into first-order/SQL • Query rewriting algorithms for a class of Select-Project-Join queries • With set semantics • With bag semantics, grouping, and aggregation • Maximality of the class of queries

  28. Contributions – Practice • Implementation of ConQuer • Designed to compute consistent answers efficiently • Multiple rewriting strategies • Experimental validation of efficiency and scalability • Representative queries from TPC-H • Large databases

  29. Uncertain Data PROVENANCE INFORMATION (e.g., source reputation) Integrated Database 0.3 Web 0.3 0.7 1 0.3 0.7 0.7 Sales

  30. Publications and Demo • These and other contributions appear in • ICDT05/JCSS06 • SIGMOD05 • ICDE06 • PODS06/TODS06 • VLDB06 • Demo given at VLDB05 • http://queens.db.toronto.edu/project/conquer/demo2/

  31. Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions

  32. A Virtuous Cycle Query Answering Data Integration • Use knowledge about inconsistencies to: • give better answers • suggest ways to clean the database Recognize and characterize inconsistent data

  33. Beyond the Enterprise • Can we apply principled models of inconsistency or uncertainty to the Web? • Different assumptions • Uncertainty in queries • There’s never a “true” answer • Challenge • Build models based on user preferences • Leverage massive repositories of user behavior data

  34. THANK YOU Plug:Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33

More Related