1 / 32

An Efficient Distance Calculation Method for Uncertain Objects

An Efficient Distance Calculation Method for Uncertain Objects. Edward Hung csehung@comp.polyu.edu.hk Hong Kong Polytechnic University 2007 CIDM, Hawaii, USA, Apr 1-5, 2007. Outline. Why we care about uncertain objects and their distances?

tamal
Download Presentation

An Efficient Distance Calculation Method for Uncertain Objects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Distance Calculation Method for Uncertain Objects Edward Hung csehung@comp.polyu.edu.hk Hong Kong Polytechnic University 2007 CIDM, Hawaii, USA, Apr 1-5, 2007

  2. Outline • Why we care about uncertain objects and their distances? • Analytic Solutions for Uniform and Gaussian Distributions • Five Approximation Methods (DM, PRS, GAPS, PGM, ASG) for Arbitrary Distributions • Equivalence of PRS, PGM and ASG • Performance Study • Conclusion

  3. Uncertain Objects: From Where? • Sources • Readings from sensors • Classification results of image processing using statistical classifiers • Results from predictive programs used for stock market • Weather prediction • Etc

  4. Uncertain Objects: How to Represent? • Representation • An exact value with margins of error • E.g., 156±0.5, [23.8, 24.9] • An uncertainty domain with a probability distribution/density function (PDF/pdf) • Discrete E.g., for object o1, UD(o1) = {5.1,5.2,5.3}, P1(5.1) = 0.3, P1(5.2)=0.4, P1(5.3)=0.3 • Continuous E.g., for object o2 with uniform distribution, UD(o2) = [6,11], p2(x) = 0.2 where 6 ≤ x ≤ 11

  5. Uncertain Objects handled traditionally … • Transformed into exact values to store in traditional databases • Weighted average or mean • Value of highest frequency or possibility • Why bad?? • Intermediate and final results of mining or queries will also be approximate and may be wrong • E.g., deviation of cluster centroids and wrong assignment of some data • Shown in experimental results later

  6. Distance: Why Important? • Various queries and data mining tasks, e.g., • Nearest-neighbor queries • Clustering (e.g., K-means clustering)

  7. Distance: Why Expensive? • An uncertain object has more than one possible location • Discrete E.g., o1 (o2) has n1 (n2) possible locations • n1n2 possible pair-wise combinations of their locations to calculate distances • Probability of each location may be different o1 o2

  8. Distance: Why Expensive? • Continuous E.g., take n samples on each uncertain object • More samples in region of higher probability density • Each sample has the same probability o1 o2

  9. Distance: Why Expensive? • Approximation by a grid of a finite number of cells formed on the uncertainty domain (region)1 • A grid of 14X14 cells • Probability of each cell determined by sampling • All combinations of cells of two objects  196X196 distance calculations 1e.g., used in Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM).

  10. Why Expected Distance? • All possible pair-wise combinations  a distance function di,j(x) to return the probability (or density) that the distance between objects oi and oj is x • VERY expensive (previous slides) • Expected distance: weighted average of all combinations’ distances • Could be much cheaper IF we do NOT need to try all combinations • Squared Euclidean distance chosen • Easier integration compared with Euclidean distance or Manhattan distance

  11. Analytic Solutions • Uniform pdf • Gaussian pdf

  12. Uniform pdf • c2+(a2-ab+b2)/3 • C2+r2/2 • C2+3r2/5 • C2+(r12+r22)/3 (5) C2+(r12+r22)/2 (6) C2+r12/2+3r22/5 (7) C2+3(r12+r22)/5

  13. Gaussian pdf • For objects oi with Gaussian pdf N(μi,Σi), where μi is a dX1 mean vector, Σi is a dXd covariance matrix, • Expected distance between objects oi, oj is • EDAS(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • where trace(Σi) is sum of all diagonal elements in Σi

  14. Approximation Methods for Arbitrary pdf • 5 methods proposed: • Distance between Means (DM) • Pair-wise between Random Samples (PRS) • Grid Approximation and Pair-wise between Samples (GAPS) • Pair-wise between Gaussian Mixture (PGM) • Approximation by Single Gaussian (ASG)

  15. 1. Distance between Means (DM) • EDDM(oi, oj) = ||μi -μj||2 o1 o2

  16. 2. Pair-wise between Random Samples (PRS) • take n samples on each uncertain object • More samples in region of higher probability density; each sample has the same probability o1 o2

  17. 3. Grid Approximation and Pair-wise between Samples (GAPS) • Approximation by a grid of √s X √s cells formed on the uncertainty domain • Probability of each cell determined by sampling

  18. 4. Pair-wise between Gaussian Mixture (PGM) • Approximate an uncertain object oi by a mixture of Gaussian distributions: ∑uCi Ai,uN(μi,u,Σi,u) • use K-means to cluster samples into a few clusters): • EDPGM(oi, oj) = ∑uCi ∑vCj Ai,uAj,v(||μi,u –μj,v||2 + trace(Σi,u) + trace(Σj,v)) o1 o2

  19. 5. Approximation by Single Gaussian (ASG) • Approximate an uncertain object oi by a single Gaussian distributions: • N(μi,Σi) • EDASG(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • Complexity = O((ni+nj)d) o1 o2

  20. Equivalence of PRS, PGM and ASG • Theorem: • Given any uncertain objects oi, oj and their samples xi,1,…,xi,ni, xj,1,…,xj,nj, EDPRS(oi,oj)=EDPGM(oi,oj)=EDASG(oi,oj) • Theoretically ASG is the most inexpensive compared with all other methods (except DM) with the same results as PRS and PGM • What about compared with DM and GAPS?

  21. Performance Study • Experimental results also show that ASG is: • much more accurate than DM with comparable speed • much faster than GAPS with higher or comparable accuracy • # grid cells = # samples

  22. Experiment 1 • 100 uncertain objects (4 Gaussian pdfs, variances in [1,10])

  23. Experiment 1

  24. Experiment 1 • ASG: ~ 0.02ms

  25. Experiment 2 • Data generated in the way as • Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM) • A grid of 14X14 cells • Probability of each cell randomly generated • normalized GAPS produces the correct solution, but how close is ASG?

  26. Experiment 2

  27. Experiment 2 • ASG: ~ 0.02ms

  28. Experiment 3 • ASG also approximates well objects with uniform pdf • 10 objects with radius in [1,5], random located in 100X100 2D space • ASG takes 100 samples, and repeats for 6 times • Accuracy • Worst case > 0.98 • Average > 0.99

  29. Experiment 4 • Scalability w.r.t. # Dimensions • 2/3/4-D • 256/216/256 samples/cells • ASG • Accuracy: 0.97 – 0.99 • Time: ~ 0.02ms or less

  30. Experiment 4

  31. Experiment 4

  32. Conclusion • Importance of expected distance calculation in queries and data mining applications on uncertain data • Analytic solutions of special cases (uniform/Gaussian pdf) • ASG can obtain highly accurate results quickly • ASG can replace GAPS used in recent research work

More Related