Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

International Conference onExtending Database Technology 2009 Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University) Xike Xie (University of Hong Kong) Cheng, Chen, Chen, Xie

Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie

Data Uncertainty • Inherent in various applications • Location-based services (e.g., using GPS, RFID) [TDRP98, SSDBM99] • Natural habitat monitoring with sensor networks [VLDB04a] • Biomedical and biometric databases[ICDE06, ICDE07] Cheng, Chen, Chen, Xie

Attribute Uncertainty Model [TDRP98,ISSD99,VLDB04b] pdf y (pdf) Uncertainty region We represent an uncertainty pdf as a histogram Cheng, Chen, Chen, Xie

k-NN Queries • k-NN Query over Precise Data - application in LBS [VLDB03] - natural habitat monitoring system [VLDB04a] - network traffic analysis [ICDCS07] - pattern matching in CAM [VLDB04c] • k-NN over Uncertain Objects - [VLDB08a] ranks the probability each object is the NN of the query point. - [ICDE07a] use expected distance and does not discuss the probability. Cheng, Chen, Chen, Xie

Probability Threshold k-Nearest-Neighbor Query (T-k-PNN) INPUT • A query point q, parameter k, threshold T • A set of n objects with uncertainty regions and pdfs OUTPUT • A number of k-subset • p(S) is the qualification probability of the k-subset S Cheng, Chen, Chen, Xie

{O1, O2 , O3} {O1, O2 , O4} Example of a k-PNN query (k=3) O7 O3 O2 O6 O5 q O8 O1 O4 Cheng, Chen, Chen, Xie

Example of a k-PNN query (k=3) O7 • {O1, O2, O3} • {O1, O2, O4} … • {O6, O7, O8} O3 O2 O6 O5 q O8 O1 O4 • {O1, O2, O3} • {O1, O2, O4} … • {O4, O5, O6} k-bound Cheng, Chen, Chen, Xie

k-bound Filtering (k=3) fk (k-bound): is the k-th minimum maximum distance O7 O3 f3 O2 f2 O6 Since min(r7)> f3, O7can not be 3-NN of q. Because there are always 3 objects with distances smaller than f3. O5 q f1 O8 O1 O4 We apply k-bound filtering on an index (e.g. R-tree) to prune unqualified objects. k-bound Cheng, Chen, Chen, Xie

Basic solution for a T-k-PNN query (k=3,T=0.1) T=0.1 Step3: Accept S, if qp(S)≥T Step1: k-bound filtering Step2: QP Calculation 3-subset QP {O1, O2, O3} 0.2 O3 O2 {O1, O2, O4} 0.1 O6 {O1, O3, O4} 0.1 O5 {O2, O3, O4} 0.1 q {O1, O2, O5} 0.05 O1 O4 {O1, O2, O5} 0.05 {O1, O3, O5} 0.05 Too many k-subsets! {O2, O3, O5} 0.05 k-bound Exact QP is expensive to compute! …… Cheng, Chen, Chen, Xie

Efficient Solution Framework (GVR) k-subsets Generation Verification Refinement 1. k-bound Filtering Candidate Objects k-subset Generation 2. Probabilistic Candidate Selection k-subsets 3. rejected k-subsets k-subset Verification And Refinement Upper bound Lower bound accepted k-subsets 4. Refinement Cheng, Chen, Chen, Xie

S1={O4, O5,O6} cp(S1)=0.5*0.2*0.1 = 0.01 S2={O4, O5} cp(S2)=0.5*0.2 = 0.1 Probabilistic Candidates Selection Cutoff Probability of Oi : Pr(ri≤fk) O3 O2 O6 O5 0.2 0.1 q O1 O4 0.5 k-bound Given T=0.2, if cp(S2) < T, then qp(S1)<cp(S1)<T. S1 can be pruned. Cheng, Chen, Chen, Xie

Probabilistic Candidates Selection T=0.2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Cheng, Chen, Chen, Xie

Storage Efficient Compression 2-subset CP Size-2 Set {O1,O2} 1 {O1,O5} {O1,O3} 1 {O2,O5} {O1,O4} 0.5 {O3,O5} {O1,O5} 0.2 {O2,O3} 1 Compressed subsets {O2,O4} 0.5 {O2,O5} 0.2 {O3,O4} 0.5 {O3,O5} 0.2 Original subsets Subsets are sorted in descending order of their CPs. Store the common prefix of the subsets And the last element of the subset that has the minimum product of cutoff probability greater than T Cheng, Chen, Chen, Xie

Storage Efficient Compression T=0,2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Size-1 Set {O1} Size-2 Set {O2} {O1,O5} {O3} {O2,O5} {O4} Cheng, Chen, Chen, Xie {O3,O5} {O5}

f1 min(r4) f2 f3 Seeds Pruning max(r1) =f1 max(r2) =f2 max(r3) =f3 k=3 Seeds: o1, o2, o3 O4 O1 min(r4) > f2 > f1 O3 q O2 If o4 belongs to a 3-nn set S, o1 and o2 must also belong to S. r4 > r2 r4 > r1 For example, we can prune the set {o1,o3,o4}, according to the above rule. No CP calculation is needed. Can prune more candidate k-sets Cheng, Chen, Chen, Xie

1 0.19 S1 1 0.19 0 0 Incremental Refinement Verifier 0.18 1 0.6 ? 0.03 0.15 S2 0.5 1 0.1 0 1 0.54  0.14 S3 1 0.4 0 Verifiers: Upper and Lower Bounds(T=0.2) Candidates k-subsets (After PCS) Classifier Cheng, Chen, Chen, Xie

Verification and Refinement Divide the range [min(r1), fk] into a series of partitions. Build a data structure, i.e. stair-case model, to store the distance cdf of each object. Derive the lower and upper bounds of a k-set’s QP based on the stair-case model. Partitions Stair-Case Model Reject (Accept) a k-set once its QP must be lower (larger) than the threshold. Extended from the probabilistic verifiers in [ICDE08b] Cheng, Chen, Chen, Xie

Experiment Setup Cheng, Chen, Chen, Xie

1. k-bound Filtering Cheng, Chen, Chen, Xie

2. Performance of GVR Cheng, Chen, Chen, Xie

3. k-subset Generation Cheng, Chen, Chen, Xie

4. Verification and Refinement Cheng, Chen, Chen, Xie

5. Time Analysis Cheng, Chen, Chen, Xie

6. Gaussian Distribution Cheng, Chen, Chen, Xie

Conclusion • We proposed an efficient evaluation framework for T-k-PNN query • We proposed various techniques: - k-bound to filter away those unqualified objects - PCS to reduce the number of k-subsets - verification/refinement methods to avoid exact calculation • Future Work - extend the techniques to other queries Cheng, Chen, Chen, Xie

Reference • [TDRP98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao,“Querying the uncertain position of moving objects,” in Temporal Databases: Research and Practice, 1998. • [SSDBM99] D.Pfoser and C. Jensen, “Capturing the uncertainty of moving-objects representations,” in Proc. SSDBM, 1999. • [VLDB04a] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, “Model-driven data acquisition in sensor networks,” in Proc. VLDB, 2004. • [ICDE06] C. Böhm, A. Pryakhin, and M. Schubert, “The gauss-tree: Efficient object identification in databases of probabilistic feature vectors,” in Proc. ICDE, 2006. • [ICDE07a] V. Ljosa and A. K. Singh, “APLA: Indexing arbitrary probability distributions,” in Proc. ICDE, 2007. • [SIGMOD03] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in Proc. ACM SIGMOD, 2003. • [ICDE07b] J. Chen and R. Cheng, “Efficient evaluation of imprecise location-dependent queries,” in Proc. ICDE, 2007. • [VLDB06a] M. Mokbel, C. Chow, and W. G. Aref, “The new casper: Query processing for location services without compromising privacy,” in VLDB, 2006. • [TKDE92] D. Barbara, H. Garcia-Molina, and D. Porter, “The management of probabilistic data,” TKDE, vol. 4, no. 5, 1992. • [VLDB04b] N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in VLDB, 2004. • [VLDB06b] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom, “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006. • [VLDB03] G. Iwerks, H. Samet, and K. Smith, “Continuous k-nearest neighbor queries for continuously moving points with updates,” in Proc. VLDB, 2003. • [ICDCS07] S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of ddos attacks,” in ICDCS, 2007. • [AKDDM96] U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. • [VLDB04c] N. Koudas, B. Ooi, K. Tan, and R. Zhang, “Approximate NN queries on streams with guaranteed error/performance bounds,” in Proc. VLDB, 2004. • [VLDB08a] G. Beskales, M. Soliman, and I. Ilyas, “Efficient search for the top-k probable nearest neighbors in uncertain databases,” in VLDB, 2008. • [VLDB06c] O. Mar, A. Sarma, A. Halevy, and J. Widom, “ULDBs: databases with uncertainty and lineage,” in VLDB, 2006. Cheng, Chen, Chen, Xie

Reference • [VLDB07a] L. Antova, C. Koch, and D. Olteanu, “Query language support for incomplete information in the maybms system,” in Prof. VLDB, 2007. • [SIGMOD08a] S. Singh et al, “Orion 2.0: Native support for uncertain data,” in Prof. ACM SIGMOD, 2008. • [ICDE08a] Singh et al, “Database support for pdf attributes,” in Proc. ICDE, 2008. • [TKDE04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecise data in moving object environments,” IEEE TKDE, vol. 16, no. 9, Sept. 2004. • [DASFAA07] H. Kriegel, P. Kunath, and M. Renz, “Probabilistic nearest-neighbor query on uncertain objects,” in DASFAA, 2007. • [MUD08] Y. Qi, S. Singh, R. Shah, and S. Prabhakar, “Indexing probabilistic nearest-neighbor threshold queries,” in Proc. Workshop on Management of Uncertain Data, 2008. • [TKDE08] X. Lian and L. Chen, “Probabilistic group nearest neighbor queries in uncertain databases,” IEEE Trans. On Knowledge and Data Engineering, vol. 20, no. 6, 2008. • [ICDE08b] R. Cheng, J. Chen, M. Mokbel, and C. Chow, “Probabilistic verifiers: Evaluating constrained nearest-neighbor queries over uncertain data,” in Proc. ICDE, 2008. • [VLDB05] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexing multi-dimensional uncertain data with arbitrary probability density functions,” in Proc. VLDB, 2005. • [VLDB07b] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic skylines on uncertain data,” in Proc. VLDB, 2007. • [SIGMOD08b] X. Lian and L. Chen, “Monochromatic and bichromatic reverse skyline search over uncertain databases,” in Proc. SIGMOD, 2008. • [ICDE07c] M. Soliman, I. Ilyas, and K. Chang, “Top-k query processing in uncertain databases,” in Proc. ICDE, 2007. • [SIGMOD08c] M. Hua, J. Pei, W. Zhang, and X. Lin, “Ranking queries on uncertain data: A probabilistic threshold approach,” in Proc. SIGMOD, 2008. • [VLDB08b] V. Rastogi, D. Suciu, and E. Welbourne, “Access control over uncertain data,” in Proc. VLDB, 2008. • [VLDB08c] C. Koch and D. Olteanu, “Conditioning probabilistic databases,” in Proc. VLDB, 2008. • [VLDB08d] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data with quality guarantees,” in Proc. VLDB, 2008. • [SIGMOD84] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” Proc. of the ACM SIGMOD Int’l. Conf., 1984. Cheng, Chen, Chen, Xie

Q & A Thanks! Cheng, Chen, Chen, Xie

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Presentation Transcript

K-nearest neighbor methods

Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data

K-Nearest Neighbor Learning

Evaluating Probabilistic Queries over Uncertain Matching

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach

Top- k Queries on Uncertain Data

K Nearest Neighbor Classification Methods

Supporting Range Queries on Web Data Using k -Nearest Neighbor Search

Cleaning Uncertain Data for Top-k Queries

Nearest Neighbor Queries using R-trees

Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects

Nearest Neighbor

K Nearest Neighbor Classification Methods

K nearest neighbor

Efficient Algorithms to Monitor Continuous Constrained k Nearest Neighbor Queries

K Nearest Neighbor Classification Methods

K-Nearest Neighbor

K-Nearest Neighbor Learning

Nearest Neighbor Queries using R-trees

“ Continuous All k Nearest Neighbor Queries in Smartphone Networks ”

Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data

Nearest Neighbor Queries using R-trees