Computational Geometry and Spatial Data Mining

Marc van Kreveld (and Giri Narasimhan) Department of Information and Computing Sciences Utrecht University Computational Geometry and Spatial Data Mining

Clustering? • Are the people clustered in this room? • How do we define a cluster? • In spatial data mining we have objects/ entities with a location given by coordinates • Cluster definitions involve distance between locations • How do we define distance?

Clustering - options • Determine whether clustering occurs • Determine the degree of clustering • Determine the clusters • Determine the largest cluster • Determine the largest empty region • Determine the outliers

Co-location • Are the men clustered? • Are the women clustered? • Is there a co-location of men and women? • Determine regions favored exclusively by women. Men? Loners? Couples? Families? • Determine empty regions.

Co-location • Like before, we may be interested in • is there co-location? • the degree of co-location • the largest co-location • the co-locations themselves • the objects not involved in co-location • Regions with no (or little) co-location

Spatio-temporal data • Locations have a time stamp • Interesting patterns involve space and time • Anomalies?

Trajectory data • Entities with a trajectory (time-stamped motion path) • Interesting patterns involve subgroupswith similar heading, expected arrival,joint motion, ... • n entities = trajectories; n = 10 – 100,000 • t time steps; t = 10 – 100,000 input size is nt • m size subgroup (unknown); m = 10 – 100,000

Examples of trajectory data • Tracked animals (buffalo, birds, ...) • Tracked people (potential terrorists) • Tracked GSMs (e.g. for traffic purposes) • Trajectories of tornadoes • Sports scene analysis (players on a soccer field)

Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius

Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius 4 entities

Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius 3 entities

Example pattern in trajectories • Compute buffer of each trajectory

1 Example pattern in trajectories • Compute buffer of each trajectory • Compute the arrangement of the buffers and the cover count of each cell 1 1 1 2 0 1

Example pattern in trajectories • One trajectory has t time stamps; its buffer can be computed in O(t log t) time • All buffers can be computed in O(nt log t) time • The arrangement can be computed in O(nt log (nt) + k) time, where k = O( (nt)2 ) is the complexity of the arrangement • Cell cover counts are determined in O(k) time

Example pattern in trajectories • Total: O(nt log (nt) + k) time • If the most visited location is visited bym entities, this is O(nt log (nt) + ntm) • Note: input size is nt ;n entities, each with location at t moments

Patterns in entity data Spatial data • n points (locations) • Distance is important • clustering pattern • Presence of attributes (e.g. man/woman): • co-location patterns Spatio-temporal data • n trajectories, each has t time steps • Distance is time-dependent • flock pattern • meet pattern • Heading and speed are important and are also time-dependent

Entities in subdivisions • Also co-location pattern • Discovered simply by overlayE.g., occurrences of oakson different soil types

Clustering entities in subdivisions • What if it is known that the entities only occur in regions of a certain type? Situation without subdivision radius of cluster bird nests

Clustering entities in subdivisions • What if it is known that the entities only occur in regions of a certain type? Situation with subdivisionland-water radius of cluster bird nests

house car Clustering entities in subdivisions burglary

Region-restricted clustering Joint research with Joachim Gudmundsson (NICTA, Sydney) and Giri Narasimhan (U of F, Miami), 2006 • Determine clusters in point sets that are sensitive to the geographic context (at least, for the relevant aspects) Assume that a set of regions is given where points can only be, how should we define clusters?

Region-restricted clustering • Given a set P of points, a set F of regions, a radius r and a subset size m, aregion-restricted cluster is a subset P’P inside a circle C where • P’ has size at least m • C has radius at most 2r • C contains at most r2 area of regions of F r ≤ 2r sum area ≤ r2

Region-restricted clustering • Given a set P of n points, a set F of polygons with nf edges in total, and values for r and m, report all region-restricted clusters of exactly m points • Exactly m points? • “Real” clustering (partition)? • Outliers?

Region-restricted clustering • Exactly m points?Every cluster with >m points consists of clusters with m points with smaller circles • “Real” clustering (partition)? • Outliers? m = 5

Region-restricted clustering • Determine all smallest circles with m points of P inside • Test if the radius is ≤r (report) or > 2r (discard) • If the radius is in between, determine the area of regions of F inside

Region-restricted clustering: Step 1 • Determine all minimal circles with m points of P inside • Determine all minimal circles with 3 points of P inside

ordinary = order-1 VD

Region-restricted clustering • Determine all smallest circles with m points of P inside • Use (m-2)-th order Voronoi diagram: cells where the same (m-2) points are closest • Its vertices are centers of smallest circles around exactly m points

ordinary = order-1 VD

order-2 VD

order-3 VD

Region-restricted clustering • The m-th order Voronoi diagram (or (m-2)) has O(nm) cells, edges, and vertices • It can be constructed in O(nm log n) time we get O(nm) smallest circles with m points inside; for each we also know the radius

Region-restricted clustering 2. Test if the radius is ≤r (report) or > 2r (discard) Trivial in O(1) time per circle, so in O(nm) time overall

Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) time per circle, so in O(nmnf) time overall

Region-restricted clustering • Complication: This need not give all region-restricted clusters! • Need to compute area of F inside a circle with moving center • Requires solving high-degree polynomials

Region-restricted clusters • The anti-climax: we cannot give an exact algorithm! • If we takes squares instead of circles, we can deal with the problem ....

Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) time per square, so in O(nmnf) time overall The total time for steps 1, 2, and 3 isO(nm log n) + O(nm) + O(nmnf) = O(nm log n + nmnf) time

Region-restricted clustering 3. Determine the area of regions of F inside Using a suitable data structure (only possible for squares): O(log2nf) time per square, so in O(nm log2 nf) time overall The total time becomes O(nm log n + nflog2 nf +nm log2 nf) total query time in data structure order- (m-2) VD construction preprocessing of data structure

Region-restricted clustering • The squares solution generalizes toregular polygons (e.g. 20-gons) • An approximation of the radius within (1+)r gives a O(n/2 + nf log2nf + n log nf /(m 2)) time algorithm 16-gon

Region-restricted clustering • Open problems: • Develop a region-restricted version of k-means clustering, single link clustering, ... • Region-restricted co-location? • Replace region-restricted by gradual model typical: clusters: 0 /unit 2 /unit 5 /unit 8 /unit

Patterns in trajectories • n trajectories, each with t time steps n polygonal lines with t vertices • Already looked at most visited location

Patterns in trajectories • Flock: near positions of (sub)trajectories for some subset of the entities during some time • Convergence: same destination region for some subset of the entities • Encounter: same destination region with same arrival time for some subset of the entities • Similarity of trajectories • Same direction of movement, leadership, ...... flock convergence

Computational Geometry and Spatial Data Mining