1 / 34

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

International Conference on Extending Database Technology 2009. Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data. Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University)

bairn
Download Presentation

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. International Conference onExtending Database Technology 2009 Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University) Xike Xie (University of Hong Kong) Cheng, Chen, Chen, Xie

  2. Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

  3. Data Uncertainty • Inherent in various applications • Location-based services (e.g., using GPS, RFID) [TDRP98, SSDBM99] • Natural habitat monitoring with sensor networks [VLDB04a] • Biomedical and biometric databases[ICDE06, ICDE07] Cheng, Chen, Chen, Xie

  4. Attribute Uncertainty Model [TDRP98,ISSD99,VLDB04b] pdf y (pdf) Uncertainty region We represent an uncertainty pdf as a histogram Cheng, Chen, Chen, Xie

  5. k-NN Queries • k-NN Query over Precise Data - application in LBS [VLDB03] - natural habitat monitoring system [VLDB04a] - network traffic analysis [ICDCS07] - pattern matching in CAM [VLDB04c] • k-NN over Uncertain Objects - [VLDB08a] ranks the probability each object is the NN of the query point. - [ICDE07a] use expected distance and does not discuss the probability. Cheng, Chen, Chen, Xie

  6. Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

  7. Probability Threshold k-Nearest-Neighbor Query (T-k-PNN) INPUT • A query point q, parameter k, threshold T • A set of n objects with uncertainty regions and pdfs OUTPUT • A number of k-subset • p(S) is the qualification probability of the k-subset S Cheng, Chen, Chen, Xie

  8. {O1, O2 , O3} {O1, O2 , O4} Example of a k-PNN query (k=3) O7 O3 O2 O6 O5 q O8 O1 O4 Cheng, Chen, Chen, Xie

  9. Example of a k-PNN query (k=3) O7 • {O1, O2, O3} • {O1, O2, O4} … • {O6, O7, O8} O3 O2 O6 O5 q O8 O1 O4 • {O1, O2, O3} • {O1, O2, O4} … • {O4, O5, O6} k-bound Cheng, Chen, Chen, Xie

  10. k-bound Filtering (k=3) fk (k-bound): is the k-th minimum maximum distance O7 O3 f3 O2 f2 O6 Since min(r7)> f3, O7can not be 3-NN of q. Because there are always 3 objects with distances smaller than f3. O5 q f1 O8 O1 O4 We apply k-bound filtering on an index (e.g. R-tree) to prune unqualified objects. k-bound Cheng, Chen, Chen, Xie

  11. Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

  12. Basic solution for a T-k-PNN query (k=3,T=0.1) T=0.1 Step3: Accept S, if qp(S)≥T Step1: k-bound filtering Step2: QP Calculation 3-subset QP {O1, O2, O3} 0.2 O3 O2 {O1, O2, O4} 0.1 O6 {O1, O3, O4} 0.1 O5 {O2, O3, O4} 0.1 q {O1, O2, O5} 0.05 O1 O4 {O1, O2, O5} 0.05 {O1, O3, O5} 0.05 Too many k-subsets! {O2, O3, O5} 0.05 k-bound Exact QP is expensive to compute! …… Cheng, Chen, Chen, Xie

  13. Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

  14. Efficient Solution Framework (GVR) k-subsets Generation Verification Refinement 1. k-bound Filtering Candidate Objects k-subset Generation 2. Probabilistic Candidate Selection k-subsets 3. rejected k-subsets k-subset Verification And Refinement Upper bound Lower bound accepted k-subsets 4. Refinement Cheng, Chen, Chen, Xie

  15. S1={O4, O5,O6} cp(S1)=0.5*0.2*0.1 = 0.01 S2={O4, O5} cp(S2)=0.5*0.2 = 0.1 Probabilistic Candidates Selection Cutoff Probability of Oi : Pr(ri≤fk) O3 O2 O6 O5 0.2 0.1 q O1 O4 0.5 k-bound Given T=0.2, if cp(S2) < T, then qp(S1)<cp(S1)<T. S1 can be pruned. Cheng, Chen, Chen, Xie

  16. Probabilistic Candidates Selection T=0.2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Cheng, Chen, Chen, Xie

  17. Storage Efficient Compression 2-subset CP Size-2 Set {O1,O2} 1 {O1,O5} {O1,O3} 1 {O2,O5} {O1,O4} 0.5 {O3,O5} {O1,O5} 0.2 {O2,O3} 1 Compressed subsets {O2,O4} 0.5 {O2,O5} 0.2 {O3,O4} 0.5 {O3,O5} 0.2 Original subsets Subsets are sorted in descending order of their CPs. Store the common prefix of the subsets And the last element of the subset that has the minimum product of cutoff probability greater than T Cheng, Chen, Chen, Xie

  18. Storage Efficient Compression T=0,2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Size-1 Set {O1} Size-2 Set {O2} {O1,O5} {O3} {O2,O5} {O4} Cheng, Chen, Chen, Xie {O3,O5} {O5}

  19. f1 min(r4) f2 f3 Seeds Pruning max(r1) =f1 max(r2) =f2 max(r3) =f3 k=3 Seeds: o1, o2, o3 O4 O1 min(r4) > f2 > f1 O3 q O2 If o4 belongs to a 3-nn set S, o1 and o2 must also belong to S. r4 > r2 r4 > r1 For example, we can prune the set {o1,o3,o4}, according to the above rule. No CP calculation is needed. Can prune more candidate k-sets Cheng, Chen, Chen, Xie

  20. 1 0.19 S1 1 0.19 0 0 Incremental Refinement Verifier 0.18 1 0.6 ? 0.03 0.15 S2 0.5 1 0.1 0 1 0.54  0.14 S3 1 0.4 0 Verifiers: Upper and Lower Bounds(T=0.2) Candidates k-subsets (After PCS) Classifier Cheng, Chen, Chen, Xie

  21. Verification and Refinement Divide the range [min(r1), fk] into a series of partitions. Build a data structure, i.e. stair-case model, to store the distance cdf of each object. Derive the lower and upper bounds of a k-set’s QP based on the stair-case model. Partitions Stair-Case Model Reject (Accept) a k-set once its QP must be lower (larger) than the threshold. Extended from the probabilistic verifiers in [ICDE08b] Cheng, Chen, Chen, Xie

  22. Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

  23. Experiment Setup Cheng, Chen, Chen, Xie

  24. 1. k-bound Filtering Cheng, Chen, Chen, Xie

  25. 2. Performance of GVR Cheng, Chen, Chen, Xie

  26. 3. k-subset Generation Cheng, Chen, Chen, Xie

  27. 3. k-subset Generation Cheng, Chen, Chen, Xie

  28. 4. Verification and Refinement Cheng, Chen, Chen, Xie

  29. 5. Time Analysis Cheng, Chen, Chen, Xie

  30. 6. Gaussian Distribution Cheng, Chen, Chen, Xie

  31. Conclusion • We proposed an efficient evaluation framework for T-k-PNN query • We proposed various techniques: - k-bound to filter away those unqualified objects - PCS to reduce the number of k-subsets - verification/refinement methods to avoid exact calculation • Future Work - extend the techniques to other queries Cheng, Chen, Chen, Xie

  32. Reference • [TDRP98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao,“Querying the uncertain position of moving objects,” in Temporal Databases: Research and Practice, 1998. • [SSDBM99] D.Pfoser and C. Jensen, “Capturing the uncertainty of moving-objects representations,” in Proc. SSDBM, 1999. • [VLDB04a] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, “Model-driven data acquisition in sensor networks,” in Proc. VLDB, 2004. • [ICDE06] C. Böhm, A. Pryakhin, and M. Schubert, “The gauss-tree: Efficient object identification in databases of probabilistic feature vectors,” in Proc. ICDE, 2006. • [ICDE07a] V. Ljosa and A. K. Singh, “APLA: Indexing arbitrary probability distributions,” in Proc. ICDE, 2007. • [SIGMOD03] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in Proc. ACM SIGMOD, 2003. • [ICDE07b] J. Chen and R. Cheng, “Efficient evaluation of imprecise location-dependent queries,” in Proc. ICDE, 2007. • [VLDB06a] M. Mokbel, C. Chow, and W. G. Aref, “The new casper: Query processing for location services without compromising privacy,” in VLDB, 2006. • [TKDE92] D. Barbara, H. Garcia-Molina, and D. Porter, “The management of probabilistic data,” TKDE, vol. 4, no. 5, 1992. • [VLDB04b] N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in VLDB, 2004. • [VLDB06b] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom, “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006. • [VLDB03] G. Iwerks, H. Samet, and K. Smith, “Continuous k-nearest neighbor queries for continuously moving points with updates,” in Proc. VLDB, 2003. • [ICDCS07] S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of ddos attacks,” in ICDCS, 2007. • [AKDDM96] U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. • [VLDB04c] N. Koudas, B. Ooi, K. Tan, and R. Zhang, “Approximate NN queries on streams with guaranteed error/performance bounds,” in Proc. VLDB, 2004. • [VLDB08a] G. Beskales, M. Soliman, and I. Ilyas, “Efficient search for the top-k probable nearest neighbors in uncertain databases,” in VLDB, 2008. • [VLDB06c] O. Mar, A. Sarma, A. Halevy, and J. Widom, “ULDBs: databases with uncertainty and lineage,” in VLDB, 2006. Cheng, Chen, Chen, Xie

  33. Reference • [VLDB07a] L. Antova, C. Koch, and D. Olteanu, “Query language support for incomplete information in the maybms system,” in Prof. VLDB, 2007. • [SIGMOD08a] S. Singh et al, “Orion 2.0: Native support for uncertain data,” in Prof. ACM SIGMOD, 2008. • [ICDE08a] Singh et al, “Database support for pdf attributes,” in Proc. ICDE, 2008. • [TKDE04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecise data in moving object environments,” IEEE TKDE, vol. 16, no. 9, Sept. 2004. • [DASFAA07] H. Kriegel, P. Kunath, and M. Renz, “Probabilistic nearest-neighbor query on uncertain objects,” in DASFAA, 2007. • [MUD08] Y. Qi, S. Singh, R. Shah, and S. Prabhakar, “Indexing probabilistic nearest-neighbor threshold queries,” in Proc. Workshop on Management of Uncertain Data, 2008. • [TKDE08] X. Lian and L. Chen, “Probabilistic group nearest neighbor queries in uncertain databases,” IEEE Trans. On Knowledge and Data Engineering, vol. 20, no. 6, 2008. • [ICDE08b] R. Cheng, J. Chen, M. Mokbel, and C. Chow, “Probabilistic verifiers: Evaluating constrained nearest-neighbor queries over uncertain data,” in Proc. ICDE, 2008. • [VLDB05] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexing multi-dimensional uncertain data with arbitrary probability density functions,” in Proc. VLDB, 2005. • [VLDB07b] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic skylines on uncertain data,” in Proc. VLDB, 2007. • [SIGMOD08b] X. Lian and L. Chen, “Monochromatic and bichromatic reverse skyline search over uncertain databases,” in Proc. SIGMOD, 2008. • [ICDE07c] M. Soliman, I. Ilyas, and K. Chang, “Top-k query processing in uncertain databases,” in Proc. ICDE, 2007. • [SIGMOD08c] M. Hua, J. Pei, W. Zhang, and X. Lin, “Ranking queries on uncertain data: A probabilistic threshold approach,” in Proc. SIGMOD, 2008. • [VLDB08b] V. Rastogi, D. Suciu, and E. Welbourne, “Access control over uncertain data,” in Proc. VLDB, 2008. • [VLDB08c] C. Koch and D. Olteanu, “Conditioning probabilistic databases,” in Proc. VLDB, 2008. • [VLDB08d] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data with quality guarantees,” in Proc. VLDB, 2008. • [SIGMOD84] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” Proc. of the ACM SIGMOD Int’l. Conf., 1984. Cheng, Chen, Chen, Xie

  34. Q & A Thanks! Cheng, Chen, Chen, Xie

More Related