340 likes | 503 Views
International Conference on Extending Database Technology 2009. Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data. Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University)
E N D
International Conference onExtending Database Technology 2009 Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University) Xike Xie (University of Hong Kong) Cheng, Chen, Chen, Xie
Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie
Data Uncertainty • Inherent in various applications • Location-based services (e.g., using GPS, RFID) [TDRP98, SSDBM99] • Natural habitat monitoring with sensor networks [VLDB04a] • Biomedical and biometric databases[ICDE06, ICDE07] Cheng, Chen, Chen, Xie
Attribute Uncertainty Model [TDRP98,ISSD99,VLDB04b] pdf y (pdf) Uncertainty region We represent an uncertainty pdf as a histogram Cheng, Chen, Chen, Xie
k-NN Queries • k-NN Query over Precise Data - application in LBS [VLDB03] - natural habitat monitoring system [VLDB04a] - network traffic analysis [ICDCS07] - pattern matching in CAM [VLDB04c] • k-NN over Uncertain Objects - [VLDB08a] ranks the probability each object is the NN of the query point. - [ICDE07a] use expected distance and does not discuss the probability. Cheng, Chen, Chen, Xie
Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie
Probability Threshold k-Nearest-Neighbor Query (T-k-PNN) INPUT • A query point q, parameter k, threshold T • A set of n objects with uncertainty regions and pdfs OUTPUT • A number of k-subset • p(S) is the qualification probability of the k-subset S Cheng, Chen, Chen, Xie
{O1, O2 , O3} {O1, O2 , O4} Example of a k-PNN query (k=3) O7 O3 O2 O6 O5 q O8 O1 O4 Cheng, Chen, Chen, Xie
Example of a k-PNN query (k=3) O7 • {O1, O2, O3} • {O1, O2, O4} … • {O6, O7, O8} O3 O2 O6 O5 q O8 O1 O4 • {O1, O2, O3} • {O1, O2, O4} … • {O4, O5, O6} k-bound Cheng, Chen, Chen, Xie
k-bound Filtering (k=3) fk (k-bound): is the k-th minimum maximum distance O7 O3 f3 O2 f2 O6 Since min(r7)> f3, O7can not be 3-NN of q. Because there are always 3 objects with distances smaller than f3. O5 q f1 O8 O1 O4 We apply k-bound filtering on an index (e.g. R-tree) to prune unqualified objects. k-bound Cheng, Chen, Chen, Xie
Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie
Basic solution for a T-k-PNN query (k=3,T=0.1) T=0.1 Step3: Accept S, if qp(S)≥T Step1: k-bound filtering Step2: QP Calculation 3-subset QP {O1, O2, O3} 0.2 O3 O2 {O1, O2, O4} 0.1 O6 {O1, O3, O4} 0.1 O5 {O2, O3, O4} 0.1 q {O1, O2, O5} 0.05 O1 O4 {O1, O2, O5} 0.05 {O1, O3, O5} 0.05 Too many k-subsets! {O2, O3, O5} 0.05 k-bound Exact QP is expensive to compute! …… Cheng, Chen, Chen, Xie
Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie
Efficient Solution Framework (GVR) k-subsets Generation Verification Refinement 1. k-bound Filtering Candidate Objects k-subset Generation 2. Probabilistic Candidate Selection k-subsets 3. rejected k-subsets k-subset Verification And Refinement Upper bound Lower bound accepted k-subsets 4. Refinement Cheng, Chen, Chen, Xie
S1={O4, O5,O6} cp(S1)=0.5*0.2*0.1 = 0.01 S2={O4, O5} cp(S2)=0.5*0.2 = 0.1 Probabilistic Candidates Selection Cutoff Probability of Oi : Pr(ri≤fk) O3 O2 O6 O5 0.2 0.1 q O1 O4 0.5 k-bound Given T=0.2, if cp(S2) < T, then qp(S1)<cp(S1)<T. S1 can be pruned. Cheng, Chen, Chen, Xie
Probabilistic Candidates Selection T=0.2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Cheng, Chen, Chen, Xie
Storage Efficient Compression 2-subset CP Size-2 Set {O1,O2} 1 {O1,O5} {O1,O3} 1 {O2,O5} {O1,O4} 0.5 {O3,O5} {O1,O5} 0.2 {O2,O3} 1 Compressed subsets {O2,O4} 0.5 {O2,O5} 0.2 {O3,O4} 0.5 {O3,O5} 0.2 Original subsets Subsets are sorted in descending order of their CPs. Store the common prefix of the subsets And the last element of the subset that has the minimum product of cutoff probability greater than T Cheng, Chen, Chen, Xie
Storage Efficient Compression T=0,2, k=3 1-subset CP 2-subset CP 3-subset CP {O1} 1 {O1,O2} 1 {O1, O2, O3} 1 {O2} 1 {O1,O3} 1 {O1, O2, O4} 0.5 {O3} 1 {O1,O4} 0.5 {O1, O2, O5} 0.2 {O4} 0.5 {O1,O5} 0.2 {O1, O3, O4} 0.5 {O5} 0.2 {O2,O3} 1 {O1, O3, O5} 0.2 {O6} 0.1 {O2,O4} 0.5 {O1, O4, O5} 0.1 {O2,O5} 0.2 {O2, O3, O4} 0.5 {O3,O4} 0.5 {O2, O3, O5} 0.2 {O3,O5} 0.2 {O2, O4, O5} 0.1 {O4,O5} 0.1 {O3, O4, O5} 0.1 Size-1 Set {O1} Size-2 Set {O2} {O1,O5} {O3} {O2,O5} {O4} Cheng, Chen, Chen, Xie {O3,O5} {O5}
f1 min(r4) f2 f3 Seeds Pruning max(r1) =f1 max(r2) =f2 max(r3) =f3 k=3 Seeds: o1, o2, o3 O4 O1 min(r4) > f2 > f1 O3 q O2 If o4 belongs to a 3-nn set S, o1 and o2 must also belong to S. r4 > r2 r4 > r1 For example, we can prune the set {o1,o3,o4}, according to the above rule. No CP calculation is needed. Can prune more candidate k-sets Cheng, Chen, Chen, Xie
1 0.19 S1 1 0.19 0 0 Incremental Refinement Verifier 0.18 1 0.6 ? 0.03 0.15 S2 0.5 1 0.1 0 1 0.54 0.14 S3 1 0.4 0 Verifiers: Upper and Lower Bounds(T=0.2) Candidates k-subsets (After PCS) Classifier Cheng, Chen, Chen, Xie
Verification and Refinement Divide the range [min(r1), fk] into a series of partitions. Build a data structure, i.e. stair-case model, to store the distance cdf of each object. Derive the lower and upper bounds of a k-set’s QP based on the stair-case model. Partitions Stair-Case Model Reject (Accept) a k-set once its QP must be lower (larger) than the threshold. Extended from the probabilistic verifiers in [ICDE08b] Cheng, Chen, Chen, Xie
Agenda • 1. Introduction • 2. Problem Definition • 3. Basic Solution • 4. Efficient Solution • 5. Results Cheng, Chen, Chen, Xie
Experiment Setup Cheng, Chen, Chen, Xie
1. k-bound Filtering Cheng, Chen, Chen, Xie
2. Performance of GVR Cheng, Chen, Chen, Xie
3. k-subset Generation Cheng, Chen, Chen, Xie
3. k-subset Generation Cheng, Chen, Chen, Xie
4. Verification and Refinement Cheng, Chen, Chen, Xie
5. Time Analysis Cheng, Chen, Chen, Xie
6. Gaussian Distribution Cheng, Chen, Chen, Xie
Conclusion • We proposed an efficient evaluation framework for T-k-PNN query • We proposed various techniques: - k-bound to filter away those unqualified objects - PCS to reduce the number of k-subsets - verification/refinement methods to avoid exact calculation • Future Work - extend the techniques to other queries Cheng, Chen, Chen, Xie
Reference • [TDRP98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao,“Querying the uncertain position of moving objects,” in Temporal Databases: Research and Practice, 1998. • [SSDBM99] D.Pfoser and C. Jensen, “Capturing the uncertainty of moving-objects representations,” in Proc. SSDBM, 1999. • [VLDB04a] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, “Model-driven data acquisition in sensor networks,” in Proc. VLDB, 2004. • [ICDE06] C. Böhm, A. Pryakhin, and M. Schubert, “The gauss-tree: Efficient object identification in databases of probabilistic feature vectors,” in Proc. ICDE, 2006. • [ICDE07a] V. Ljosa and A. K. Singh, “APLA: Indexing arbitrary probability distributions,” in Proc. ICDE, 2007. • [SIGMOD03] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in Proc. ACM SIGMOD, 2003. • [ICDE07b] J. Chen and R. Cheng, “Efficient evaluation of imprecise location-dependent queries,” in Proc. ICDE, 2007. • [VLDB06a] M. Mokbel, C. Chow, and W. G. Aref, “The new casper: Query processing for location services without compromising privacy,” in VLDB, 2006. • [TKDE92] D. Barbara, H. Garcia-Molina, and D. Porter, “The management of probabilistic data,” TKDE, vol. 4, no. 5, 1992. • [VLDB04b] N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in VLDB, 2004. • [VLDB06b] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom, “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006. • [VLDB03] G. Iwerks, H. Samet, and K. Smith, “Continuous k-nearest neighbor queries for continuously moving points with updates,” in Proc. VLDB, 2003. • [ICDCS07] S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of ddos attacks,” in ICDCS, 2007. • [AKDDM96] U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. • [VLDB04c] N. Koudas, B. Ooi, K. Tan, and R. Zhang, “Approximate NN queries on streams with guaranteed error/performance bounds,” in Proc. VLDB, 2004. • [VLDB08a] G. Beskales, M. Soliman, and I. Ilyas, “Efficient search for the top-k probable nearest neighbors in uncertain databases,” in VLDB, 2008. • [VLDB06c] O. Mar, A. Sarma, A. Halevy, and J. Widom, “ULDBs: databases with uncertainty and lineage,” in VLDB, 2006. Cheng, Chen, Chen, Xie
Reference • [VLDB07a] L. Antova, C. Koch, and D. Olteanu, “Query language support for incomplete information in the maybms system,” in Prof. VLDB, 2007. • [SIGMOD08a] S. Singh et al, “Orion 2.0: Native support for uncertain data,” in Prof. ACM SIGMOD, 2008. • [ICDE08a] Singh et al, “Database support for pdf attributes,” in Proc. ICDE, 2008. • [TKDE04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecise data in moving object environments,” IEEE TKDE, vol. 16, no. 9, Sept. 2004. • [DASFAA07] H. Kriegel, P. Kunath, and M. Renz, “Probabilistic nearest-neighbor query on uncertain objects,” in DASFAA, 2007. • [MUD08] Y. Qi, S. Singh, R. Shah, and S. Prabhakar, “Indexing probabilistic nearest-neighbor threshold queries,” in Proc. Workshop on Management of Uncertain Data, 2008. • [TKDE08] X. Lian and L. Chen, “Probabilistic group nearest neighbor queries in uncertain databases,” IEEE Trans. On Knowledge and Data Engineering, vol. 20, no. 6, 2008. • [ICDE08b] R. Cheng, J. Chen, M. Mokbel, and C. Chow, “Probabilistic verifiers: Evaluating constrained nearest-neighbor queries over uncertain data,” in Proc. ICDE, 2008. • [VLDB05] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexing multi-dimensional uncertain data with arbitrary probability density functions,” in Proc. VLDB, 2005. • [VLDB07b] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic skylines on uncertain data,” in Proc. VLDB, 2007. • [SIGMOD08b] X. Lian and L. Chen, “Monochromatic and bichromatic reverse skyline search over uncertain databases,” in Proc. SIGMOD, 2008. • [ICDE07c] M. Soliman, I. Ilyas, and K. Chang, “Top-k query processing in uncertain databases,” in Proc. ICDE, 2007. • [SIGMOD08c] M. Hua, J. Pei, W. Zhang, and X. Lin, “Ranking queries on uncertain data: A probabilistic threshold approach,” in Proc. SIGMOD, 2008. • [VLDB08b] V. Rastogi, D. Suciu, and E. Welbourne, “Access control over uncertain data,” in Proc. VLDB, 2008. • [VLDB08c] C. Koch and D. Olteanu, “Conditioning probabilistic databases,” in Proc. VLDB, 2008. • [VLDB08d] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data with quality guarantees,” in Proc. VLDB, 2008. • [SIGMOD84] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” Proc. of the ACM SIGMOD Int’l. Conf., 1984. Cheng, Chen, Chen, Xie
Q & A Thanks! Cheng, Chen, Chen, Xie