1 / 32

Geometric Approach

Geometric Approach. Geometric Interpretation: Each node holds a statistics vector Coloring the vector space Grey:: function > threshold White:: function <= threshold. Goal: determine color of global data vector (average). Bounding the Convex Hull.

mbonds
Download Presentation

Geometric Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geometric Approach • Geometric Interpretation: • Each node holds a statistics vector • Coloring the vector space • Grey:: function > threshold • White:: function <= threshold • Goal: determine color of global data vector (average).

  2. Bounding the Convex Hull • Observation: average is in the convex hull  • If convex hull monochromatic then average too • But – convex hull may become large

  3. Drift Vectors • Periodically calculate an estimate vector - the current global • Each node maintains a drift vector – the change in the local statistics vector since the last time the estimate vector was calculated • Global average statistics vector is also the average of the drift vectors

  4. The Bounding Theorem [SIGMOD’06] • A reference point is known to all nodes • Each vertex constructs a sphere • Theorem: convex hull is bounded by the union of spheres •  Local constraints!

  5. Basic Algorithm • An initial estimate vector is calculated • Nodes check color of drift spheres • Drift vector is the diameter of the drift sphere • If any sphere non monochromatic: node triggers re-calculation of estimate vector

  6. Reuters Corpus (RCV1-v2) • 800,000+ news stories • Aug 20 1996 -- Aug 19 1997 • Corporate/Industrial tagging n=10 10 nodes, random data distribution

  7. Trade-off: Accuracy vs. Performance • Inefficiency: value of function on average is close to the threshold • Performance can be enhanced at the cost of less accurate result: • Set error margin around the threshold value

  8. Performance Analysis

  9. Performance Analysis (cntd.)

  10. Balancing • Globally calculating average is costly • Often possible to average only some of the data vectors.

  11. Shape Sensitivity [PODS’08] • Fitting cover to Data • Fitting cover to threshold surface • Specific function classes

  12. Fitting Cover to Data(using the covariance matrix)

  13. Fitting Cover to Threshold Surface --Reference Vector Selection

  14. Distance Fields Skeleton, Medial Axis

  15. Results –Shape Sensitivity

  16. ΔVp2 ΔV2 ΔVp1 ΔV1 ΔV3 ΔVp3 ΔVp5 ΔVp4 ep e ΔV5 ΔV4 Prediction-Based Geometric Monitoring [SIGMOD’12] f(v(t)) > T v(t) • Stricter local constraints if local predictions remain accurate • Keeping up with v(t) movement

  17. Local Constraints Safe Zones! Let the nodes communicate only when “something happens” Tell me only if your measurement is larger than 50! Send me your current measurements!

  18. Local Distributions Reasonable to assume future data will behave similarly… These Safe Zones save more communication!

  19. Optimal Safe Zones

  20. Example: Air quality monitoring What are the optimal Safe Zones…?

  21. The Optimization Problem Is this Convex? Is this Linear? How many constraints are these? BAD NEWS: This problem is NP-hard.

  22. The Optimization Problem X • Step 3: Use non-convex optimization toolboxes (e.g. Matlab’s “fmincon”). • These toolboxes use sophisticated Gradient Descent algorithms and return close-to-optimal results.

  23. Data Set How the data looks like

  24. Ratio Queries Example of triangular Safe Zones

  25. Improvement over convex-hull cover method 5’000 hours Up to 200 nodes were involved in the experiment. The average improvement was by a factor of 17.5 Why do we improve so much?

  26. Higher Dimensions

  27. Chi-Square Monitoring (5D) Examples of axis aligned boxes as Safe Zones

  28. Improvement over GM 1’000 hours 90 nodes The improvement over the Geometric Method gets more substantial in higher dimensions.

  29. Safe Zones - Example

  30. Biclique: Non-Convex Safe Zones Safe Zone Algorithm (for 2 nodes): Take the data points, build a bipartite graph(how?), find the maximal Biclique, these are your Safe Zones!

  31. Conclusions • Local filtering for large-scale distributed data systems • Saving in communication is unlimited • Bounded only by the aggregate over system lifetime • Saving bandwidth, central resources, power. • Not necessary to sacrifice precision and latency • Less communication  more Privacy

More Related