1 / 50

Geometric and combinatorial issues in data depth

Greg Aloupis Universit é Libre de Bruxelles. Geometric and combinatorial issues in data depth. What is data depth?. A quantitative measurement of how central a point is with respect to a data set. Goals: to be able to rank data points, and to find the center of the data cloud.

aggie
Download Presentation

Geometric and combinatorial issues in data depth

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Greg Aloupis Université Libre de Bruxelles Geometric and combinatorial issues in data depth

  2. What is data depth? A quantitative measurement of how central a point is with respect to a data set. • Goals: to be able to rank data points, and to find the center of the data cloud.

  3. Some geometric bivariate medians • Convex hull peeling(Tukey ’70s) • ’85 Chazelle (nlogn) • Halfspace median(Tukey ’74) • ’01 Langerman-Steiger O(nlog 3n), ’03 Chan O(nlog n) -randomized • Oja median(Oja ’83) • ’01 G.A.-Langerman-Soss-Toussaint O(nlog 3n) • Simplicial median(Liu ’88) • ’01 ALST O(n4)

  4. Convex hull peeling

  5. Convex hull peeling

  6. Convex hull peeling

  7. Convex hull peeling

  8. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S Each median is a point with max/min depth

  9. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Tukey) halfspace depth: For every line through , count points above/below. Return minimum number counted over all lines.

  10. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Tukey) halfspace depth: For every line through , count points above/below. Return minimum number counted over all lines.

  11. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Tukey) halfspace depth: For every line through , count points above/below. Return minimum number counted over all lines.

  12. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Tukey) halfspace depth: For every line through , count points above. Return minimum number counted over all lines.

  13. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain .

  14. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain .

  15. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain .

  16. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain .

  17. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain .

  18. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S (Liu) simplicial depth: Count the closedtriangles in S that contain . …etc

  19. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S Oja depth: Sum areas of all triangles with vertices (,si ,sj)

  20. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S Oja depth: Sum areas of all triangles with vertices (,si ,sj)

  21. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S Oja depth: Sum areas of all triangles with vertices (,si ,sj)

  22. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S Oja depth: Sum areas of all triangles with vertices (,si ,sj) …etc

  23. Halfspace, simplicial and Oja depthsof a point  in bivariate data set S O(nlog n): Khuller-Mitchell ’89, Gil-Steiger-Wigderson ’92, Roussewu-Ruts ‘96 W(nlog n): G.A.-Cortes-Gomez-Soss-Toussaint ’01, Langerman-Steiger ’01, G.A.-McLeish ’05

  24. Issue 1:What is the complexity of computing the depth k of a point if k is known to be small/large? • If the peel median has depth k>1 then can we compute it faster? (GSW’92) • !!! this just in: simplicial depth in O(n+nlog (1+ k/n)) • Elmasry-Elbassioni  CCCG last week • Is there a lower bound, sensitive to parameter k? • Something similar for halfspace depth? • Current attempts for O(nlog k)

  25. Issue 2: (Improve) simplicial median computation • Remember, that horrible n4 result a few slides back

  26. Easy observation • I : the set of line segments between pairs in S. • The simplicial median is on an intersection of two segments in I.

  27. Outline of a method • Preprocessing: O(n3) brute-force, actually O(n2) • Count number of points above/below each segment. • Compute depth of all points. • For each segment, • sort all intersections with other segments. • O(n2log n). • Calculate depth of each intersection in O(1) time: • O(n2) • Overall O(n4log n)

  28. Constant time to update depth as we walk

  29. Instead of sorting intersection points and processing each segment alone, we can use topological sweep. The time complexity becomes O(n4) and the space used is O(n2). Can we improve this? i.e. find some structure in this depth function

  30. Conjecture: • A point of maximum simplicial depth can always be found on the intersection of two halving segments • (weak) experiments have not contradicted this

  31. Desirable properties of data depth functions Affine invariance (at the very least) Robustness: Outliers should not influence the center. Monotonicity: Center should move in “same” direction as perturbations

  32. monotonicity

  33. Robustness to outliers • breakdown point: fraction of data that must be moved/added so that median is placed at infinity. • Oja median:was considered to be robust, but finally it was shown that the breakdown point can be near zero for certain configurations. (planar case) (Niinimaa,Oja,Tableman ’90) • simplicial median:don’t know. But the data point of maximum depth can be moved away with few corrupting points (GSW ’92) (planar case) • halfspace median:great! … 1/(d+1) (Donoho,Gasko ’92)

  34. Robustness to outliers • breakdown point: fraction of data that must be moved/added so that median is placed at infinity. • Max breakdown = ½ • In 1D, only the median is affine invariant, monotonic and has max breakdown • Is there such an estimator in higher dimensions?

  35. Issue 3How does the breakdown point depend on the depth of the median? • Convex peeling: breakdown is …zero, unless depth is linear (GSW’92) • Halfspace breakdown is higher (1/3) for centrosymmetric data distributions, where depth is roughly 1/2 • Instead of 1/(d+1) • So what can we say about other estimators? • For deepest point in plane • For deepest data point

  36. Issue 4:Non-strategic breakdown All work so far involved carefully placing outliers (erroneous or corrupt data), to move an estimator far away. (is corrupt data really placed carefully in practice?) What about: • average outliers (random or evenly spaced placement) • strong breakdown (should work regardless of direction at infinity) • special-case outliers (axis-parallel, or radial extension, or ?)

  37. Issue 5:Computing/analyzing other estimators • Projection outlyingness of q:(Donoho-Gasko ’92) Take max of the following, over all projections : |q-Median| / (median deviation from median) • Find an algorithm for the least outlying point. • Gil-Steiger-Wigderson: • superposition of unit vectors to data points = v(ai) • median is a (data) point with || v(ai) R || < 1 ??? • computation in o(n2) ? Properties? • Zonoid depth, Delaunay depth …

  38. Issue 6Points of high depth • A point w/ Tukey depth  n/(d+1) is a centerpoint. • Guaranteed to exist, by Helly’s thm. • O(n) time computation (Jadhav-Mukhopadhyay ’94) • Can be considered to be a median generalization. • ¼ (n 3)  simplicial depth  2/9 (n 3) (Boros-Furedi ’82) • (in R2 , ignoring quadratic terms) • Can we compute a “high” depth point quickly? • Tverberg points in R2 have depth  1/27 (n 3) and can be computed in O(n) time. Anything better? • Is there a point with “high” Oja depth? (normalized)

  39. Things I may have mentioned in the abstract but forgot to include here: • Is it faster to locate a deep point without computing its depth? • How many points have depth>k ? • When do simplicial depth levels become disconnected?

  40. merci

More Related