1 / 33

Achieving Data Quality with AJAX

Achieving Data Quality with AJAX. (first version of AJAX designed and developed at INRIA Rocquencourt, France ). Existing technology. Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language Programs difficult to optimize and maintain

rhian
Download Presentation

Achieving Data Quality with AJAX

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

  2. Existing technology • Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language • Programs difficult to optimize and maintain • RDBMS mechanisms for guaranteeing integrity constraints • Do not address important data instance problems • Data transformation scripts using an ETL (Extraction-Transformation-Loading)or data quality tool

  3. App. Domain 1 App. Domain 2 App. Domain 3 Problems of data quality solutions (1) Data cleaning transformations ... The semantics of some data transformations is defined in terms of their implementation algorithms

  4. Clean data Rejected data Cleaning process Dirty Data Problems of data quality solutions (2) There is a lack of interactive facilities to tune a data cleaning application program

  5. Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) Events(eventKey, name) PubsAuthors(pubKey, authorKey) Data Cleaning & Transformation Motivating example (1) DirtyData(paper:String)

  6. Authors Publications DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Events PubsAuthors PDIS | Conference on Parallel and Distributed Information Systems Data Cleaning & Transformation QGMW96 | DQua QGMW96 | AGup …. DirtyData [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95 Motivating example (2)

  7. Modeling a data quality process Authors Duplicate Elimination DirtyTitles... DirtyEvents DirtyAuthors • A data quality process is • modeled by a directed acyclic • graph of data transformations Extraction Standardization Cities Tags Formatting DirtyData

  8. AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions

  9. AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions

  10. View Cluster Match Map Merge Apply Logical level: parametric operators • View: arbitrary SQL query • Map: iterator-based one-to-many mapping with arbitrary user-defined functions • Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function • Merge: extends SQL group-by with user-defined aggregate functions • Apply: executes an arbitrary user-defined algorithm

  11. Authors Duplicate Elimination DirtyTitles... DirtyAuthors Extraction Standardization Cities Tags Formatting DirtyData Logical level

  12. Physical level Authors Authors Merge Java Scan Duplicate Elimination Cluster TC Match NL DirtyTitles... DirtyTitles... DirtyAuthors DirtyAuthors Extraction Map Java Scan Standardization Map Java Scan Cities Tags Cities Tags Formatting Map SQL Scan DirtyData DirtyData Logical level

  13. Match • Input: 2relations • Finds data records that correspond to the same real object • Calls distance functions for comparing field values and computing the distance between input tuples • Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples

  14. Example Authors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination

  15. Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination

  16. Authors Merge Cluster Match DirtyAuthors Duplicate Elimination Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag ... MatchAuthors

  17. Implementation of the match operator  s1 S1,s2  S2 (s1, s2) is a match if editDistance (s1, s2) < maxDist

  18. S1 S2 ... Nested loop editDistance • Very expensive evaluation when handling large amounts of data • Need alternative execution algorithms for the same logical specification

  19. A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; • No optimization supported for a Cartesian product with external function calls

  20. Window scanning S n

  21. S Window scanning n

  22. S Window scanning n • May loose some matches

  23. length- 1 length length length + 1 editDistance String distance filtering S1 S2 John Smit John Smith Jogn Smith John Smithe maxDist = 1

  24. Annotation-based optimization • The user specifies types of optimization • The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors

  25. AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions

  26. Declarative specification DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper }

  27. Graph of data transformations DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Declarative specification

  28. AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions

  29. Management of exceptions • Problem: to mark tuples not handledby the cleaning criteria of an operator • Solution: to specify the generation of exception tupleswithin a logical operator • exceptions are thrown by external functions • output constraints are violated

  30. Debugger facility • Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions • Supports the interactive data modification and, in the future, the incremental execution of logical operators

  31. Debugging exceptions

  32. Architecture

  33. References • Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380

More Related