1 / 34

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data. Wenjie Zhang University of New South Wales & NICTA, Australia. Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA). Outline. Background and Preliminaries Probabilistic Threshold Range Aggregate Query

terah
Download Presentation

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

  2. Outline • Background and Preliminaries • Probabilistic Threshold Range Aggregate Query • Exact query processing • Approximate query processing: Simple Sampling & Double Sampling • Experiments • Conclusion DB@UNSW

  3. Applications • Many applications involve data that is imperfect due to • data randomness and incompleteness • limitation of equipment • delay or lose in data transfer • … … • Applications • Sensor networks • Environmental surveillance • Moving objects • Data cleaning and integration • … … DB@UNSW

  4. Applications • Sensor Networks: • Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08) DB@UNSW

  5. Applications • Mobile Equipments / Moving Objects • A mobile object reports its location periodically, the exact location is often uncertain. DB@UNSW

  6. Applications • Satellite data DB@UNSW

  7. Applications • Data Quality • Social Data Collection: Errors and estimation inherent in customer surveys and sampling DBG @ UNSW

  8. Outline • Background and Preliminaries • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Query • Conclusion DB@UNSW

  9. Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Continuous case: described using a probability density function (PDF) fU such that . E.g., uniform distribution, normal distribution. DB@UNSW

  10. Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Discrete case : described using a set of instances each instance u has an occurrence probability pu DB@UNSW

  11. Possible World Semantics • Given a set of uncertain objects {U1,U2, ..., Un}, a possible worldW = {u1,u2, .., un} is a set of n instances --- one instance per uncertain object • The probability of a possible worlds is P(W) = • Let Ω be the set of all possible world, clearly, DB@UNSW

  12. Probabilistic Queries: • Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07] • Aggregate Queries[BDJR05, MJ07, CG07] • Join Queries [CSP06, AW07] • Top-k queries [SIC07, YLSK08, RDS07, HJZL08] • Nearest Neighbor Queries [KKR07, CCMC08] • Skyline Queries[PJLY07] • … … DB@UNSW

  13. Range query • Uncertain objects, exact query • Probability threshold is often assigned DBG @ UNSW

  14. Related Work • Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t. Appearance probability DB@UNSW

  15. U-tree Probabilistically Constrained Region ( PCR ) [TCXNKP05] PCR (0.2) Multi PCRs DB@UNSW

  16. Outline • Introduction • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Aggregate Query (PTRA) • Conclusion DB@UNSW

  17. Contribution • Formally define PTRA query • aU-Tree structure for exact PTRA query • singleSample and doubleSample techniques for approximate answer. DB@UNSW

  18. Problem Statement Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq DB@UNSW

  19. Problem Definition Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b) DB@UNSW

  20. Exact Query Processing ( aU-Tree) • Main idea: add aggregate information on U-tree • Advantage: stop at intermediate level if pruned or fully covered by the query • Disadvantage: otherwise, still need to drill down to the leaf nodes. • For a large portion of uncertain objects, appearance probability needs to be computed • Expensive for a massive number of instances per object! DB@UNSW

  21. Exact Query Processing ( aU-Tree) DB@UNSW

  22. singleSample • Sampling the instances of the uncertain objects. • If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m DB@UNSW

  23. singleSample ( cont. ) An immediate application of Chernoff-Hoeffding bound DB@UNSW

  24. doubleSample • Single Sampling is expensive when there is a massive number of objects! • Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects. DB@UNSW

  25. doubleSample: Accuracy • Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. • Using Chernoff-Hoeffding bound. DB@UNSW

  26. doubleSample: Our Approach • Skew! • Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) • The optimization problem is NP-hard. • Observation: • Min-skew is a good heuristic to conduct such a group. • aU-tree groups objects with a similar principle to the min-skew. DB@UNSW

  27. doubleSample: Our Approach • Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! • Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. • Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) • Step 2: sample objects in each subtree. • Step 3. sample instances in each sampled object. DB@UNSW

  28. Experiments Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K DB@UNSW

  29. Efficiency DB@UNSW

  30. Accuracy DB@UNSW

  31. Accuracy ( cont. ) DB@UNSW

  32. Conclusion • Definition of PTRA • aU-Tree technique • Sampling technique • Future work. Any approach with theoretic guarantee? DB@UNSW

  33. Thanks DB@UNSW

  34. Min-Skew technique DB@UNSW

More Related