1 / 14

Data Cleaning and Transformation – Instance conflict resolution

Data Cleaning and Transformation – Instance conflict resolution. Helena Galhardas DEI IST

marty
Download Presentation

Data Cleaning and Transformation – Instance conflict resolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))

  2. Instance conflicts resolution & merging • Occur due to poor data quality • Errors exist when collecting or entering data • Sources are not updated • Three types: • representation conflicts, e.g. dollar vs. euro • key equivalence conflicts, i.e. same real world objects with different identifiers • attribute value conflicts, i.e. instances corresponding to same real world objects and sharing an equivalent key, differ on other attributes

  3. Example Attribute conflicts Key conflicts

  4. Techniques • At design time or query time, even if actual conflicts occur at query time • Key conflicts require object identification techniques • There are several techniques for dealing with attribute conflicts

  5. General approach for solving attribute-level conflicts • Declaratively specify how to deal with conflicts: • Set of conflict resolution functions • E.g., count(), min(), max(), random(), etc • Set of strategies to deal with conflicts • Correspond to diff. Tolerance degrees • Query model that takes into account posible conflicts directly • Use SQL directly or extensions to it.

  6. Conflict resolution and result merging: list of proposals • SQL-based conflict resolution (Naumann et al 2002) • Aurora (Ozsu et al 1999) • Fan et al 2001 • FraSQL-based conflict resolution (Sattler et al 2003) • FusionPlex (Motro et al 2004) • OOra (Lim et al 1998) • DaQuinCIS (Scannapieco 2004) • (Wang et al 2000) • ConQuer (Fuxman 2005) • (Dayal 1985)

  7. Conflict resolution and result merging: list of proposals • SQL-based conflict resolution (Naumann et al 2002) • Aurora (Ozsu et al 1999) • Fan et al 2001 • FraSQL-based conflict resolution (Sattler et al 2003) • FusionPlex (Motro et al 2004) • OOra (Lim et al 1998) • DaQuinCIS (Scannapieco 2004) • (Wang et al 2000) • ConQuer (Fuxman 2005) • (Dayal 1985)

  8. Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] • Problem: Merging data in an integrated database or in a query against multiple sources • Solution: • proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST) • declarative merging of relational data sources by common queries through: • Grouping queries • Join queries (divide et impera) • nested joining (one at a time, hierarchical)

  9. Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] • Problem: Merging data in an integrated database or in a query against multiple sources • Solution: • proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST) • declarative merging of relational data sources by common queries through: • Grouping queries • Join queries(divide et impera) • nested joining (one at a time, hierarchical)

  10. Grouping queries SELECT isbn, LONGEST(title), MIN(price) FROM books GROUP BY isbn If multiple sources involved with potential duplicates: SELECT books.isbn,LONGEST(books.title), MIN(books.price) FROM ( SELECT * FROM books1 UNION SELECT * FROM books2) AS books GROUP BY books.isbn Advantages: good performance Inconvenient: user-defined aggregate functions not supported by most RDBMS

  11. Join queries (1) • Dismantle a query result that merges multiple sources into different parts, resolve conflicts within each part, and union them to a final result. • Partitions the union of the two sources into : • The intersection of the two: Query 1 SELECT books1.isbn, LONGEST(books1.title, books2.title), MIN(books1.price, books2.price) FROM books1, books2 WHERE books1.isbn = books2.isbn

  12. Join queries (2) • The part of books 1 that does not intersect with books2 (Query 2a or 2b) and vice-versa Query 2a SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books1.isbn, FROM books1, books2 WHERE books1.isbn = books2.isbn) ) Advantages: Use of built-in scalar functions available by RDBMS Query 2b SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books2.isbn, FROM books2 ) Inconvs: Length and complexity of queries

  13. Advantages and inconvenients • Pros: the proposal considers the usage of SQL, thus exploiting capabilities of existing database systems, rather than proposing proprietary solutions • Cons: limited applicability of conflict resolution functions that often need to be domain-specific

  14. References • “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006. • “Declarative data merging with conflict resolution”, F. Naumann and M. Haeussler, IQ2002.

More Related