1 / 43

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries. Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST. Roadmap. Problem – motivation Survey Proposed method – main idea Proposed method – details

ghalib
Download Presentation

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

  2. Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions

  3. Target query types DB = set of m –d points. • Range search (RS) • k nearest neighbor (KNN) • Regional distance (self-) join (RDJ) • in Louisiana, find all pairs of music stores closer than 1mi to each other

  4. Target problem Estimate • Query selectivity • Query (I/O) cost • for any Lp metric • using a single method

  5. Target Problem • for any Lp metric • using a single method

  6. Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions

  7. Older Query estimation approaches • Vast literature • Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc • BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear

  8. Main competitors • Local method • Representative methods: Histograms • Global method • Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations • Representative methods: Fractal and power law

  9. Rationale and problems of histograms • Partition the data space into a set of buckets and assume (local) uniformity • Problems • uniformity • tricky/slow estimations, for all but the L norm

  10. Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions

  11. Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10

  12. Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10

  13. “Density Trap” • Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! • This ‘trap’ will appear for any non-uniform dataset • Almost ALL real point-sets are non-uniform -> the trap is real

  14. “Density Trap” In short: is meaningless • What should we do instead?

  15. “Density Trap” In short: is meaningless • What should we do instead? • A: log(count_of_neighbors) vs log(area)

  16. Local power law • In more detail: ‘local power law’: • nb: # neighbors of point p, within radius r • cp: ‘local constant’ • np : ‘local exponent’ (= local intrinsic dimensionality)

  17. Local power law Intuitively: to avoid the ‘density trap’, use • np:local intrinsic dimensionality • instead of density

  18. Does LPL make sense? • For point ‘q’: LPL gives nbq(r) = <constant> r1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10

  19. Local power law and Lx if a point obeys L.P.L under L, ditto for any other Lx metric, with same ‘local exponent’ ->LPL works easily, for ANY Lx metric

  20. Examples #neighbors(<=r) p1 p2 radius p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2

  21. Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions

  22. Proposed method • Main idea: if we know (or can approximate) the cp and np of every point p, we can solve all the problems:

  23. Target Problem • for any Lp metric • using a single method

  24. Target Problem • for any Lp metric (Lemma3.2) • using a single method

  25. Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends • only on the ‘local exponent’ • and NOT on the ‘local constant’, • nor on the cardinality of the dataset

  26. Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do?

  27. Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do? • A: exploit locality:

  28. Implementation • nearby points: usually have similar local constants and exponents. Thus, one solution: • ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

  29. Implementation • choose anchors: with sampling, DBS, or any other method.

  30. Implementation • (In addition to ‘anchors’, we also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)

  31. Experiments - Settings • Datasets • SC that contain 40k points representing the coast lines of Scandinavia • LB that include 53k points corresponding to locations in Long Beach county • Structure: R*-tree • Compare Power method to • Minskew • Global method (fractal)

  32. Experiments - Settings • The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods • Queries: Biased (following the data distribution) • A query workload contains 500 queries • We report the average error i|actiesti|/iacti

  33. Target Problem • for any Lp metric (Lemma3.2) • using a single method

  34. Range search selectivity • the LPL method wins

  35. Target Problem • for any Lp metric (Lemma3.2) • using a single method

  36. Regional distance join selectivity • No known global method in this case • The LPL method wins, with higher margin

  37. Target Problem • for any Lp metric (Lemma3.2) • using a single method

  38. Range search query cost

  39. k nearest neighbor cost

  40. Regional distance join cost

  41. Conclusions • We spot the “density trap” problem of the local uniformity assumption (<- histograms) • we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) • and we solved all posed problems:

  42. Conclusions – cont’d • for any Lp metric • using a single method

  43. Conclusions – cont’d • for any Lp metric (Lemma3.2) • using a single method (LPL & ‘anchors’)

More Related