An Efficient Distance Calculation Method for Uncertain Objects

An Efficient Distance Calculation Method for Uncertain Objects Edward Hung csehung@comp.polyu.edu.hk Hong Kong Polytechnic University 2007 CIDM, Hawaii, USA, Apr 1-5, 2007

Outline • Why we care about uncertain objects and their distances? • Analytic Solutions for Uniform and Gaussian Distributions • Five Approximation Methods (DM, PRS, GAPS, PGM, ASG) for Arbitrary Distributions • Equivalence of PRS, PGM and ASG • Performance Study • Conclusion

Uncertain Objects: From Where? • Sources • Readings from sensors • Classification results of image processing using statistical classifiers • Results from predictive programs used for stock market • Weather prediction • Etc

Uncertain Objects: How to Represent? • Representation • An exact value with margins of error • E.g., 156±0.5, [23.8, 24.9] • An uncertainty domain with a probability distribution/density function (PDF/pdf) • Discrete E.g., for object o1, UD(o1) = {5.1,5.2,5.3}, P1(5.1) = 0.3, P1(5.2)=0.4, P1(5.3)=0.3 • Continuous E.g., for object o2 with uniform distribution, UD(o2) = [6,11], p2(x) = 0.2 where 6 ≤ x ≤ 11

Uncertain Objects handled traditionally … • Transformed into exact values to store in traditional databases • Weighted average or mean • Value of highest frequency or possibility • Why bad?? • Intermediate and final results of mining or queries will also be approximate and may be wrong • E.g., deviation of cluster centroids and wrong assignment of some data • Shown in experimental results later

Distance: Why Important? • Various queries and data mining tasks, e.g., • Nearest-neighbor queries • Clustering (e.g., K-means clustering)

Distance: Why Expensive? • An uncertain object has more than one possible location • Discrete E.g., o1 (o2) has n1 (n2) possible locations • n1n2 possible pair-wise combinations of their locations to calculate distances • Probability of each location may be different o1 o2

Distance: Why Expensive? • Continuous E.g., take n samples on each uncertain object • More samples in region of higher probability density • Each sample has the same probability o1 o2

Distance: Why Expensive? • Approximation by a grid of a finite number of cells formed on the uncertainty domain (region)1 • A grid of 14X14 cells • Probability of each cell determined by sampling • All combinations of cells of two objects  196X196 distance calculations 1e.g., used in Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM).

Why Expected Distance? • All possible pair-wise combinations  a distance function di,j(x) to return the probability (or density) that the distance between objects oi and oj is x • VERY expensive (previous slides) • Expected distance: weighted average of all combinations’ distances • Could be much cheaper IF we do NOT need to try all combinations • Squared Euclidean distance chosen • Easier integration compared with Euclidean distance or Manhattan distance

Analytic Solutions • Uniform pdf • Gaussian pdf

Uniform pdf • c2+(a2-ab+b2)/3 • C2+r2/2 • C2+3r2/5 • C2+(r12+r22)/3 (5) C2+(r12+r22)/2 (6) C2+r12/2+3r22/5 (7) C2+3(r12+r22)/5

Gaussian pdf • For objects oi with Gaussian pdf N(μi,Σi), where μi is a dX1 mean vector, Σi is a dXd covariance matrix, • Expected distance between objects oi, oj is • EDAS(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • where trace(Σi) is sum of all diagonal elements in Σi

Approximation Methods for Arbitrary pdf • 5 methods proposed: • Distance between Means (DM) • Pair-wise between Random Samples (PRS) • Grid Approximation and Pair-wise between Samples (GAPS) • Pair-wise between Gaussian Mixture (PGM) • Approximation by Single Gaussian (ASG)

1. Distance between Means (DM) • EDDM(oi, oj) = ||μi -μj||2 o1 o2

2. Pair-wise between Random Samples (PRS) • take n samples on each uncertain object • More samples in region of higher probability density; each sample has the same probability o1 o2

3. Grid Approximation and Pair-wise between Samples (GAPS) • Approximation by a grid of √s X √s cells formed on the uncertainty domain • Probability of each cell determined by sampling

4. Pair-wise between Gaussian Mixture (PGM) • Approximate an uncertain object oi by a mixture of Gaussian distributions: ∑uCi Ai,uN(μi,u,Σi,u) • use K-means to cluster samples into a few clusters): • EDPGM(oi, oj) = ∑uCi ∑vCj Ai,uAj,v(||μi,u –μj,v||2 + trace(Σi,u) + trace(Σj,v)) o1 o2

5. Approximation by Single Gaussian (ASG) • Approximate an uncertain object oi by a single Gaussian distributions: • N(μi,Σi) • EDASG(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • Complexity = O((ni+nj)d) o1 o2

Equivalence of PRS, PGM and ASG • Theorem: • Given any uncertain objects oi, oj and their samples xi,1,…,xi,ni, xj,1,…,xj,nj, EDPRS(oi,oj)=EDPGM(oi,oj)=EDASG(oi,oj) • Theoretically ASG is the most inexpensive compared with all other methods (except DM) with the same results as PRS and PGM • What about compared with DM and GAPS?

Performance Study • Experimental results also show that ASG is: • much more accurate than DM with comparable speed • much faster than GAPS with higher or comparable accuracy • # grid cells = # samples

Experiment 1 • 100 uncertain objects (4 Gaussian pdfs, variances in [1,10])

Experiment 1

Experiment 1 • ASG: ~ 0.02ms

Experiment 2 • Data generated in the way as • Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM) • A grid of 14X14 cells • Probability of each cell randomly generated • normalized GAPS produces the correct solution, but how close is ASG?

Experiment 2

Experiment 2 • ASG: ~ 0.02ms

Experiment 3 • ASG also approximates well objects with uniform pdf • 10 objects with radius in [1,5], random located in 100X100 2D space • ASG takes 100 samples, and repeats for 6 times • Accuracy • Worst case > 0.98 • Average > 0.99

Experiment 4 • Scalability w.r.t. # Dimensions • 2/3/4-D • 256/216/256 samples/cells • ASG • Accuracy: 0.97 – 0.99 • Time: ~ 0.02ms or less

Experiment 4

Conclusion • Importance of expected distance calculation in queries and data mining applications on uncertain data • Analytic solutions of special cases (uniform/Gaussian pdf) • ASG can obtain highly accurate results quickly • ASG can replace GAPS used in recent research work

An Efficient Distance Calculation Method for Uncertain Objects