1 / 19

A Framework for Data Quality Aware Query Systems

A Framework for Data Quality Aware Query Systems. Naiem, K. Yeganeh, Mohamed A. Sharaf School of Information Technology and Electrical Engineering The University of Queensland. Data Quality Aware Query System. Example: Virtual Shop.

pembroke
Download Presentation

A Framework for Data Quality Aware Query Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Data Quality Aware Query Systems Naiem, K. Yeganeh, Mohamed A. Sharaf School of Information Technology and Electrical Engineering The University of Queensland

  2. Data Quality Aware Query System • Example: Virtual Shop Google product search returns 91345 results for “Cannon Powershot”. Is user going to check all results to find the best source of information that matches his own requirements? How can we help user to find what he really wants?

  3. Data Quality Aware Query System • Multiple Sources of Information • In a virtual shop user query can be answered from various sources of information (virtual shops). • Challenge is to find the best source(s) that satisfy user requirements on data quality. • Because Data Quality = Fitness for Use

  4. Data Quality Aware Query System • Following Questions Should be Answered • How to measure the quality of data for each data source? -> DQ Profiling • How to model and capture user specific data quality preferences? -> User Preferences on DQ • How to conduct the data quality aware query and rank results to bring up data sources that satisfy user the most. -> DQ Aware Query Processing

  5. Data Quality Profiling Data Quality Metrics (Dimensions) [Wang 1996] • Accuracy (Erroneous) • Postcode “4107” is typed “4017” • Consistency (Inconsistent) • ITEE Vs. Information Technology and Electrical Engineering • Completeness (Missing) • Students don’t have to declare a major till graduation, so major is missing in most enrolments • Currency (Obsolete) • Old phone numbers • Accessibility (Unavailable) • Server down, privacy concerns • Reliability & Trust (Uncertainty) [Wang 1996] R.Y. Wang and D.M. Strong. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 1996.

  6. Data Quality Profiling • Data Quality Profiling is the measurement of DQ metrics. • DQ Profiling in the literature is considered in different granularities: • Per Data Source • Per Schema Object (relations, attributes) • Per Query (Subsets of a Schema Object)

  7. Data Quality Profiling • Manual assignment of Information Quality to Data Sources [Neumann 1999] • Each data source is assigned a set of IQ scores which are mostly assigned by users. Figure below shows an example from [Neumann 1999]. • S1..S5 are data sources, QCAs are quality correspondence assertions which are assigned by operators. EoY, Rep, … are IQ Metrics, i.e. Ease of Understanding, Reputation, etc. • QCAs are what we now call DQ Profile per Data Source [Neumann 1999] Naumann, F. and Leser, U. and Freytag, J.C., Quality-driven integration of heterogeneous information systems, VLDB 1999

  8. Data Quality Profiling • Finer grained data quality metric [Mecalla 2003] • A tree where each node represents a schema object, e.g. Data Source, Relation, Column, and Data Quality Metric [Mecalla 2003] ecella, M. and Scannapieco, M. and Virgillito, A. and Baldoni, R. and Catarci, T. and Batini, C., The DaQuinCIS broker: Querying data and their quality in cooperative information systems, Journal of Data Semantics, 2003

  9. Data Quality Profiling • A typical Data Quality Profile • Data Quality measurements are stored per schema object. Is unable to provide valid estimates for selection queries. Quality of information about Cannon products in a Sony website may not be good even if the web site has high quality data in general

  10. Data Quality Profiling • Data Quality Profiling for Selection Queries * Data quality is different for different selection queries. * Naïve approach is to pre-compute each data quality metric for any possible selection condition.* Search space will be exhaustive. Brand: C = Cannon S = Sony Model: S = SLR N = Normal Price: H = High L = Low

  11. User Data Quality Preferences • Preferences: • Preference as Partial Orders: Multi criteria decision making • Preference queries in Database Systems • Data Quality Aware SQL • Handing Inconsistency

  12. User Data Quality Preferences • Preferences: • User Preference is best modelled as sets of partial orders. [Saati 1995] • E.g. I prefer Tea over Coffee, then I prefer No Sugar over Sugar. • Or: I prefer Price over Tax, then I prefer Accuracy over Completeness, etc. I have the following preference matrix about quality of the Price attribute [Saaty 1006] T.L. Saaty. Multicriteria Decision Making: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. RWS Publications, 1996.

  13. User Data Quality Preferences • Preference queries [Govindarajan 2000] • Model User Preference as SQL SELECT X WHERE Q PREFER P1, P2 SELECT X WHERE Q PREFER maximum WRT p • Problems: • Designed for deductive databases, not suitable for Data Quality Preferences. • Not utilizing partial order definition for preference. [Govindarajan 2000] K. Govindarajan, B. Jayaraman, and S. Mantha. Preference Queries in Deductive Databases. New Generation Computing, 2000

  14. Data Quality Aware SQL • A SQL extension to query any metric in data quality profile as [Column Name.Metric Name] as part of the SQL query formulation. [Yeganeh 2009] • SELECT Title, Price FROM ShopItem WHERE Title.Completeness>0.8 • SELECT Title, Title.Accuracy, Price FROM ShopItem ORDER BY Price.Accuracy • ORDER BY Price.Accuracy models a one dimensional preference that indicates sources with higher price accuracy are preferred, a two dimensional preference can not be intuitively achieved. [Yeganeh 2009] Yeganeh, N. and Sadiq, S. and Deng, K. and Zhou, X., Data quality aware queries in collaborative information systems, APWeb 2009

  15. Data Quality Aware SQL Prioritized preferences (Utilize preference as partial order concept): E.g. from the sources with highest data quality, sources with high currency of price are prioritized over sources with high completeness of price. • Hierarchy Clause SELECT Title AS t, Price AS p, [User Comments] AS u FROM ShopItem WHERE ... HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3 HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3 Generally: • HIERARCHY(a) a.x OVER (a.x',...) n Why Hierarchy? Intuitively human defines preferences as partial orders (pairs). E.g. I prefer cofee to tea.

  16. Data Quality Aware SQL • Preferences as Partial order can be Inconsistent • For example: I prefer tea to coffee, I prefer coffee to milk, I prefer milk to tea. • Visual feedback to help user define consistence preference • Size of the circles represents weight of item. Color represents consistency of preferences (e.g. darker color means possible inconsistency). • Automatically fix inconsistencieswhen possible. [Yeganeh 2010] Yeganeh, N.K. and Sadiq, S., Avoiding Inconsistency in User Preferences for Data Quality Aware Queries, BIS 2010

  17. Possible join plans Select * from join A,B,C,D on ... A B C D Querying Interface DQ Aware Query Planning S3 S5 Sk Sj Si S9 S4 Sn Sx S1 Sy Sb .. .. .. .. • Select query plan that maximizes the quality of query results. • How to estimate qualityof each data sourcefor complex queriesI.e. joins, aggregate,etc. • Consideration of the quality of service metrics of each sourcebecomes necessary in addition to Data Quality • Data Quality of joins between different data sources is very hardto compute Communication Infrastructure S1 S2 S3 Sn

  18. Putting all together • DQAQS: Data Quality Aware Query System – A Data Quality Aware Data Integration System • Data Quality Services (DQS) Services to generate data quality profiles. • Data Quality Agents (DQA) Workers that manage generation and maintenance of data quality profiles. • Data Quality Aware Mediator (DQM) A mediator which is able to comprehend the Data Quality aware SQL and orchestrate the query execution (i.e. Data Quality Aware Query Planning) S1 Network / Cloud DQA DQS DQS DQM S2 DQS DQA DQS S3

  19. Questions?

More Related