SSCP: Mining Statistically Significant Co-location Patterns

Download Presentation

SSCP: Mining Statistically Significant Co-location Patterns

Loading in 2 Seconds...

- 130 Views
- Uploaded on
- Presentation posted in: General

SSCP: Mining Statistically Significant Co-location Patterns

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

SSCP: Mining Statistically Significant Co-location Patterns

Sajib Barua and Jörg Sander

Dept. of Computing Science

University of Alberta, Canada

- Introduction
- Related work
- Motivation

- Proposed Method
- Experimental evaluation
- Synthetic data
- Real data

- Conclusions

SSCP: Mining Statistically Significant Co-location Patterns

- Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity.
- Examples:

{Shopping mall, parking}

{Nile crocodile, Egyptian plover}

SSCP: Mining Statistically Significant Co-location Patterns

{A2, B1, C1} is an instance of co-location {A,B,C}

{A2, B1, D1} is an instance of co-location {A,B,D}

{A2, C1, D1} is an instance of co-location {A,C,D}

{B1, C1, D1} is an instance of co-location {B,C,D}

{A2, B1, C1, D1} is an instance of co-location {A, B,C,D}

{A2, B1, C1} is an instance of co-location {A,B,C}

B2

B2

C1

C1

B2

C1

{A2, B1, C1, D1} form a clique under a relation R.

C2

C2

C2

C3

C3

B1

B1

B1

C3

A2

A2

A2

D1

D1

D1

- Co-location is defined based on a spatial relationship R
- A co-location type C is a set of n different spatial features f1, f2, …, and fn.

SSCP: Mining Statistically Significant Co-location Patterns

PI ({A,B}) = min {1/2, 1/2} = 0.5

PI ({A, B}) = min {1/2, 1/2} = 0.5

PI ({B, C}) = min {1, 2/3} = 0.66

PI ({A, C}) = min {1/2, 1/3} = 0.33

PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33

PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C})

B2

B2

C1

C1

A1

A1

C2

C2

B1

B1

C3

C3

A2

A2

- Participation ratio (PR) of a feature in a co-location type C, is the fraction of its instances participating in any instance of C.
- Participation index (PI) is the minimum participation ratio in C.

PR and PI are anti-monotonic

SSCP: Mining Statistically Significant Co-location Patterns

- Spatial statistics
- Ripley’s K function, distance based measure,
co-variogram function.

- Ripley’s K function, distance based measure,
- Spatial data mining
- Koperski et al. [4] mine spatial association rules.
- Morimoto [5] also look for frequently occurring patterns.
- Shekhar et al. [2] introduce three models to materialize transaction.
- Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8].

SSCP: Mining Statistically Significant Co-location Patterns

- Spatial statistics
- Defined only for pairs.

- Co-location mining
- Only one global threshold for PI is used.
- No guideline to setup PI-threshold
- Do not address the spatial auto-correlation and feature abundance effects.

A simple threshold can report meaningless patterns or can miss meaningful patterns.

SSCP: Mining Statistically Significant Co-location Patterns

A has fewer instances

B is abundant

A & B have true spatial dependency.

Assume PI-threshold = 0.4

Existing co-location mining algorithms will not report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns

A & B are abundant.

Both randomly distributed.

Do not have any true spatial dependency.

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns

A & B are auto-correlated.

Do not have any true spatial dependency.

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns

- Our approach uses statistical test.
- Spatial dependency is measured using PI.

#○ = 12

#∆ = 12

If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PI-value (0.41)?

SSCP: Mining Statistically Significant Co-location Patterns

Observed data

Artificial data sets generated under null model

SSCP: Mining Statistically Significant Co-location Patterns

If p <= α, PIobsis statistically significant at level α.

p-value = 0.163

α = 0.05

PIobs = 0.41

SSCP: Mining Statistically Significant Co-location Patterns

A & B are auto-correlated.

Do not have any true spatial dependency.

SSCP: Mining Statistically Significant Co-location Patterns

- Auto-correlation is modeled as a cluster process.

Poisson Cluster Process [9]

- Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent.

SSCP: Mining Statistically Significant Co-location Patterns

- Estimate the summary statistics.
- Auto-correlated feature: intensity of parent and offspring process (κ, and µ values).
- Randomly distributed feature: Poisson intensity (either homogenous (a constant) or non-homogenous (a function of x and y)).

SSCP: Mining Statistically Significant Co-location Patterns

- The artificial data sets maintain the following properties of the observed data:
- same number of instances for each feature, and
- similar spatial distribution for each individual feature.

SSCP: Mining Statistically Significant Co-location Patterns

- Estimate
- Use randomization tests, where a large number of datasets conforming to the null hypothesis is generated.

- How many simulations do we need?
- Diggle suggested 500 simulations for α = 0.01 [10].

SSCP: Mining Statistically Significant Co-location Patterns

- In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated)

This saves time of the artificial data generation step of a simulation.

SSCP: Mining Statistically Significant Co-location Patterns

No need to compute

- Procedure:
- In each simulation, compute -values of all possible 2-size subsets
- For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets of C. If a subset C' is found for which < PIobs(C), is not required to be computed.
- Otherwise is computed for simulation Ri.

- In a simulation Ri, for a co-location C

SSCP: Mining Statistically Significant Co-location Patterns

Four features A, B, C, D

- {A,B,C}: If {A,B} < PIobs{A,B,C}, {A,B,C} < PIobs{A,B,C}. No need to compute {A,B,C}.
- {A,B,C} < PIobs{A,B,C} does not imply {A,B,C,D} < PIobs{A,B,C,D}.
- {A,B,C,D}: by checking 2-size subsets

The worst case complexity is O(2n)

- The size of the largest co-location is much smaller.
- Largest co-location size is predictable
- if PIobs(C) = 0, we do not compute -value of C,
- Our pruning strategies
All these keep the actual cost in practice less than the worst case cost.

SSCP: Mining Statistically Significant Co-location Patterns

Negative association:

- Features ○ and ∆ with 40 instances of each.
- This synthetic data set is generated using multi-type Strauss process to impose a negative association (inhibition) between these two features.
Result

PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported.

SSCP: Mining Statistically Significant Co-location Patterns

Autocorrelation:

- #○ = 100, and #∆ = 120.
- ∆: independently and uniformly distributed over the space
○: spatially auto-correlated

In our generated data, ∆ is found in most clusters of ○.

- The summary statistics of ○ is estimated by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05).
Results:

- PIobs {○, ∆} = 0.49, existing algorithm will report the pattern if a threshold <= 0.49 is chosen.
- p-value = 0.383 > 0.05 (α); {○, ∆} is notreported.

SSCP: Mining Statistically Significant Co-location Patterns

Multiple features:

#○ = 40, #∆ = 40, #+ = 118, #x = 40, and = #30.

- Study area = Unit square, co-location neighborhood radius = 0.1
- Features ○ and ∆ are negatively associated.
- Feature + is spatially auto-correlated.
Features +, ○, and x are positively associated.

- Feature is randomly distributed.

Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, },

{+, x, }, and {○, +, x, }.

SSCP: Mining Statistically Significant Co-location Patterns

- Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400 instances.
- Feature x: is randomly distributed, and has 20 instances.
- Our algorithm finds all co-locations of features ○, ∆, and x.
- Instances of each auto-correlated features is increased
- cluster numbers is kept same
- number of instances per cluster is increased by a factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns

- The number of clusters for features ○, ∆, and + is increased by a factor k but the number of instances per cluster is kept same.
- Total instances of x is increased by the same factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns

- ○ = Cataglyphis ants (29) and ∆ = Messor ants (68).
- PIobs {Cataglyphis, Messor} = {24/29, 30/68} = 0.44.
- p-value = 0.142 > 0.05 (α); Co-location {○, ∆} is not significant.
- R. D. Harkness also did not find any clear association between these two species.
- Existing algorithm will report {○, ∆} if PI-threshold <= 0.44.

SSCP: Mining Statistically Significant Co-location Patterns

SSCP: Mining Statistically Significant Co-location Patterns

SSCP: Mining Statistically Significant Co-location Patterns

- A new definition for co-location pattern.
- Does not depend on a global threshold.
- Statistically meaningful.
- Runtime cost of randomization tests is reduced.
- Investigate other prevalence measures to check if they allow additional pruning techniques.
- Removing redundant patterns.

SSCP: Mining Statistically Significant Co-location Patterns

- 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994)
- 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001)
- 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004)
- 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995)
- 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001)
- 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004)
- 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006)
- 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250-259 (2008).
- 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns.
- 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003

SSCP: Mining Statistically Significant Co-location Patterns

Questions?

SSCP: Mining Statistically Significant Co-location Patterns