1 / 41

Floris Geerts ( University of Antwerp )

ICDE2014. April, 1st. Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview.

joshwa
Download Presentation

Floris Geerts ( University of Antwerp )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICDE2014 April, 1st FlorisGeerts (University of Antwerp) Giansalvatore Mecca, DonatelloSantoro (Universitàdella Basilicata) Paolo Papotti (Qatar Computing Research Institute)

  2. Overview • Motivations and Goals • Semantics • Experimental Results

  3. Overview • Motivations and Goals • Semantics • Experimental Results

  4. Schema Mapping System A Mapping and Cleaning Task STRONGLY INTERRELATED PROBLEMS Data Cleaning Tools

  5. A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)

  6. Step1: To exchange data from source to target A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)

  7. A Motivating Example Source #1 (Confidence 0.7) ST-TGD [Popa et al., VLDB’02] Schema Mappingstrasformation can be expressedas a set of source to target tuplegeneratingdependencies(st-tgds) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)

  8. A Motivating Example Source #1 (Confidence 0.7) ST-TGD Target Source-to-Target TGD MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)

  9. A Motivating Example Source #2 (Confidence0.5) ST-TGD Pre-Solution for the TGDs Target Source-to-Target TGD Pat(ssn, n, p, s, c), Surg(ssn, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.5, s, c, Y3), Treat(ssn, Y4, i, t, d)

  10. Step2: To ensure Data Quality A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)

  11. A Motivating Example Functional Dependencies ST-TGD FD ID CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# fd2. Cust: Name, Str, City → SSN fd3. Treat: SSN → Salary Inclusion Dependencies id4. Treat[SSN] ⊆Customers[SSN] Conditional Functional Dependencies cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Streetfrom MD

  12. A Motivating Example Functional Dependencies VIOLATIONS ST-TGD FD ID CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# fd2. Cust: Name, Str, City → SSN fd3. Treat: SSN → Salary Inclusion Dependencies id4. Treat[SSN] ⊆Customers[SSN] Conditional Functional Dependencies cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Streetfrom MD

  13. PreviousSemantics? A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)

  14. Data Exchange [Faginet al., TCS ’05] ST-TGD ID FD • Elegantsemantics • Scalable algorithms CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# Soft Violation Hard Violation

  15. Data Repairing INTERACTION! FD ID CFD ER TGD Hard Violation • Manyapproachesand techniques[Bohannon SIGMOD ’05] [Cong VLDB ’07] [KolahiICDT ’09] [Fan VLDB ’10] [Beskales VLDB ’10] • No support for mapping • No way to handleourexample • Main-memoryimplementationonly!

  16. Pipeline • Negative Result: There exist scenarios such that pipeline doesn’t return solutions • Even when it works, its quality is usually poor Data Repairing Data Exchange ✔ Mappings ✔ Cleaning Rules ✔ Cleaning Rules ✗ Cleaning Rules ✔ Mappings ✗ Mappings

  17. Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5)

  18. Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) PreSolution for TGDs

  19. Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) 123 fd2. Cust: Name, Str, City → SSN

  20. Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) 123

  21. Contributions A Uniform Framework for Type 1 Type 2 Type 3 MD MD Schema Mapping Scenarios Data Repairing Scenarios Mapping and Cleaning Scenarios ST-TGD ID FD FD CFD ER TGD With a fast and general-purpose chase engine FD ID CFD ER

  22. Overview • Motivations and Goals • Semantics • Experimental Results

  23. Llunatic Data Repairing [Geerts et al., VLDB ‘13] • An extension of the data-repairing framework • Let’s see a quick summary… Partial Order Cell Groups LLUNs Upgrades

  24. Llunatic Data Repairing PREFERRED VALUE [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • Standard preferencerulesOrderingattribute • No order

  25. Llunatic Data Repairing [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • LLUNs • a new class of symbols • placeholders used to mark conflicts L0

  26. Llunatic Data Repairing [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • LLUNs • a new class of symbols • Cell Groups • Represent the set of changes Sky • 122-1876 g1 = <122→ {t4.phn, t5.phn} > g2= <Sky→ {t4.str, t5.str} by {tm.strauth}>

  27. Upgrades • Upgrade: an improvementover J, sinceitcontainsbettervaluewrtΠ CardinalityMinimal Update 1 Update 2 Update 3 Update 4 Update 5 • g1 <L0→ {t4.cc, t5.cc}> g3<555→ {t4.cc, t5.cc}> g4 <777→ {t4.ssn}> g2 <L1→ {t4.ssn}> g5 <333→ {t4.cc, t5.cc}> Forward Backward Upgrades J Not an upgrade e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’

  28. Upgrades over generalization Update 6 g6 <L2→ {allcells}> Update 1 Update 2 • g1 <L0→ {t4.cc, t5.cc}> g2 <L1→ {t4.ssn}> Forward Backward MinimalSolutions J e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’

  29. ST-TGDs T-TGDs User Inputs + = Non trivial extension!

  30. Mapping and Cleaning Scenario M&C Scenario M={S, Sa,T,Σt,Σe,Π, User} • S: source schema, Sa: authoritative source tablesT: target schema, Σt: TGDs, Σe: EGDs • Π: the partial order specification • User: a partial function to abstract user interaction • Solution: Given M, an instance I of S, and an instance J of T, a solution is an instance J’ such that: • it is a repair, i.e., “I and J’ satisfy Σt∪ Σe” • and “J’ is an upgradeof J according to Π”

  31. How to handle TGDs Target Source #1 (Confidence 0.7) KEY INTUITION • We model it in terms of cell groups and updates m1: MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d) g1 = <124→ {t8.ssnnew, t9.ssnnew} by {t1.ssn}> • we do not disrupt key – fkey equality in the following g2= <W. Smith→ {t8.namenew} by {t1.name}> ... new cells

  32. UserInputs • In the presence of inconsistencies user inputs are crucial. User may • change the value of a cell group • refuse a cell group • We model user interaction using a partial function over cell groups 555 g1 <123→ {t4.ph, t5.ph}> g2 <L1→ {t4.ssn}>

  33. Non trivial extension • Data cleaning semantics has some nice properties • scenario C always has a solution for <I, J> • the chase always terminates (it never fails) • Adding TGDs and User Inputs • concept of upgrade change significantly • requires to completely rework upgrades

  34. Upgrades • Must take into account many issues • some target cells are “better” than others • source cells may be authoritative • compare instances with different new values • compare instances with different number of tuples • some cells may be changed by users

  35. A Few Results • Conservative extension of the data exchange • Every (core) solution of a data exchange scenario corresponds to a (minimal) solution of its associated mapping scenario, and vice versa • Given a MC scenario, if Σtis a set of weakly-acyclictgds, then the chaseterminates • in essencewemay re-use terminationconditions for data exchange

  36. Overview • Motivations and Goals • Semantics • Experimental Results

  37. Chase Tree Differentorders of application givedifferentresults • Chase algorithmfor chasingegds and tgds J e1, b2 e0, b1 e0, f e1, f e0, b2 e1, b1 R1 R2 R3 R10 R11 R12 e1, b1 e0, b1 e1, f e1, b2 e0, f e0, b2 R4 R5 R6 R13 R14 R15 the e0-e1 sequence the e1-e0 sequence

  38. ScalabilityTechniques • Chase implementationbased on equivalenceclasses • Delta Databases • a representationsystem for chasetrees • Costmanagers • pluggablestrategies to prune the chasetree

  39. Scalability Llunatic-FR-S5 Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S50 Doctors-MC sec.

  40. Quality of Repairs Llunatic-FR-S1 Pipeline Hospital-MC Norm max. rep-rate(Rep, DBexp) 5k, 6%-10% 10k, 6%-10% 25k, 6%-10%

  41. That’s all Folks!

More Related