1 / 16

Detecting Attribute Dependencies from Query Feedback

IDS Lab. Winter Seminar. Detecting Attribute Dependencies from Query Feedback. VLDB 2007. Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA) Fabian Hueske ( univ . of Ulm, Germany) Summarized by Jung- yeon , Yang Presented by Jung- yeon , Yang 2008.01.25. Introduction.

hester
Download Presentation

Detecting Attribute Dependencies from Query Feedback

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IDS Lab. Winter Seminar Detecting Attribute Dependenciesfrom Query Feedback VLDB 2007 Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA) Fabian Hueske(univ. of Ulm, Germany) Summarized by Jung-yeon, Yang Presented by Jung-yeon, Yang 2008.01.25

  2. Introduction • Real-world datasets typically exhibit strong and complex dependencies between data attributes. • Discovering this dependency structure is important in a variety of settings • Commercial database systems currently use knowledge about dependent attributes for statistics configuration • Approaches to detection and ranking • Proactive • reactive Center for E-Business Technology

  3. The Problem • Detecting Dependent Attributes • Example : Color and Year are independent if • F(P) = fraction of rows in table that satisfy predicate P • Dependence = “significant” departure from independence • Detection needed for automatic statistics configuration in query optimizers • Which multivariate statistics should we keep? • Need to rank the dependencies (limited space budget) • Other uses include • Schema discovery for data integration • Data mining (dependency diagrams) • Root-cause analysis and system monitoring Center for E-Business Technology

  4. Previous approaches : Proactive • CORDS [IMH+, SIGMOD `04] • Sample the relation(or view) and compute a contingency table • Compute (robust) chi-squared statistic • Declare dependency if • Both t and sample size chosen using chi-squared theory Center for E-Business Technology

  5. Previous approaches : Reactive • Focus system resources on interesting attributes • Complement proactive approaches • Can exploit DB2 feedback warehouse Center for E-Business Technology

  6. A Spectrum of Reactive Approaches Center for E-Business Technology

  7. Reactive approach • Correlation Analyzer • Use multiple observations (actuals) for each attribute pair • O1 = {(blue,2005):0.02, (blue)0.2, (2005):0.103} • O2 = {(red,2006):0.07, (red)0.82, (2006):0.11} • etc • Computes ratio for each pair and compares to • O1 : • O2 : • Attribute dependency if two or more observations look dependent • Problems • Ad hoc procedures, wasted information • Unstable: depends on amount, ordering of feedback Center for E-Business Technology

  8. New Reactive approach • Dependency Discovery • Like CORDS, but uses incomplete contingency table with exact entries • Critical value u from extension of classical chi-squared theory • Normalize HM to get ranking metric Center for E-Business Technology

  9. New Reactive approach • The HM Statistic • Similar to chi-square • Choosing the Threshold u Center for E-Business Technology

  10. New Reactive approach • Missing Feedback • Assume (rough) upper bound on relative error of estimate • Can obtain from feedback-warehouse records • Handling Inconsistency • Problem : No full multivariate frequency distribution consistent with feedback • Records collected at different time points • Insert/Delete/Updates in between feedback observations • Solution method 1 : use timestamps to resolve conflicts • Solution method 2 : linear programming • Obtain minimal adjustment of frequencies needed for consistency Center for E-Business Technology

  11. New Reactive approach • Ranking Attribute Pairs • Heuristic normalizations HM / z • Table Cardinality • Minimal number of distinct values • Degrees of freedom of chi-squared distribution • * 0.995 Quantile of Center for E-Business Technology

  12. Experiment • Normalization Constants Center for E-Business Technology

  13. Experiment • Ranking vs Amount of Feedback Center for E-Business Technology

  14. Experiment • Dependency Measure vs Amount of Feedback Center for E-Business Technology

  15. Experiment • Execution Time Center for E-Business Technology

  16. Conclusions • Query feedback is an effective way to detect dependence • Chi-squared extension to implement detection • Attributes can be in multiple tables • Effective ranking methods • Practical solutions for handling inconsistent or missing feedback • Acceptable performance using sampling and incremental maintenance Center for E-Business Technology

More Related