Sscp mining statistically significant co location patterns l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

SSCP: Mining Statistically Significant Co-location Patterns PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on
  • Presentation posted in: General

SSCP: Mining Statistically Significant Co-location Patterns. Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada. Outline. Introduction Related work Motivation Proposed Method Experimental evaluation Synthetic data Real data Conclusions. Definition.

Download Presentation

SSCP: Mining Statistically Significant Co-location Patterns

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sscp mining statistically significant co location patterns l.jpg

SSCP: Mining Statistically Significant Co-location Patterns

Sajib Barua and Jörg Sander

Dept. of Computing Science

University of Alberta, Canada


Outline l.jpg

Outline

  • Introduction

    • Related work

    • Motivation

  • Proposed Method

  • Experimental evaluation

    • Synthetic data

    • Real data

  • Conclusions

SSCP: Mining Statistically Significant Co-location Patterns


Definition l.jpg

Definition

  • Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity.

  • Examples:

{Shopping mall, parking}

{Nile crocodile, Egyptian plover}

SSCP: Mining Statistically Significant Co-location Patterns


Event centric model l.jpg

{A2, B1, C1} is an instance of co-location {A,B,C}

{A2, B1, D1} is an instance of co-location {A,B,D}

{A2, C1, D1} is an instance of co-location {A,C,D}

{B1, C1, D1} is an instance of co-location {B,C,D}

{A2, B1, C1, D1} is an instance of co-location {A, B,C,D}

{A2, B1, C1} is an instance of co-location {A,B,C}

B2

B2

C1

C1

B2

C1

{A2, B1, C1, D1} form a clique under a relation R.

C2

C2

C2

C3

C3

B1

B1

B1

C3

A2

A2

A2

D1

D1

D1

Event Centric Model

  • Co-location is defined based on a spatial relationship R

  • A co-location type C is a set of n different spatial features f1, f2, …, and fn.

SSCP: Mining Statistically Significant Co-location Patterns


Prevalence measure l.jpg

PI ({A,B}) = min {1/2, 1/2} = 0.5

PI ({A, B}) = min {1/2, 1/2} = 0.5

PI ({B, C}) = min {1, 2/3} = 0.66

PI ({A, C}) = min {1/2, 1/3} = 0.33

PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33

PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C})

B2

B2

C1

C1

A1

A1

C2

C2

B1

B1

C3

C3

A2

A2

Prevalence Measure

  • Participation ratio (PR) of a feature in a co-location type C, is the fraction of its instances participating in any instance of C.

  • Participation index (PI) is the minimum participation ratio in C.

PR and PI are anti-monotonic

SSCP: Mining Statistically Significant Co-location Patterns


Related work l.jpg

Related Work

  • Spatial statistics

    • Ripley’s K function, distance based measure,

      co-variogram function.

  • Spatial data mining

    • Koperski et al. [4] mine spatial association rules.

    • Morimoto [5] also look for frequently occurring patterns.

    • Shekhar et al. [2] introduce three models to materialize transaction.

    • Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8].

SSCP: Mining Statistically Significant Co-location Patterns


Limitations of the existing methods l.jpg

Limitations of the Existing Methods

  • Spatial statistics

    • Defined only for pairs.

  • Co-location mining

    • Only one global threshold for PI is used.

    • No guideline to setup PI-threshold

    • Do not address the spatial auto-correlation and feature abundance effects.

A simple threshold can report meaningless patterns or can miss meaningful patterns.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation l.jpg

A has fewer instances

B is abundant

A & B have true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will not report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation9 l.jpg

A & B are abundant.

Both randomly distributed.

Do not have any true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation10 l.jpg

A & B are auto-correlated.

Do not have any true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Our idea l.jpg

Our Idea

  • Our approach uses statistical test.

  • Spatial dependency is measured using PI.

#○ = 12

#∆ = 12

If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PI-value (0.41)?

SSCP: Mining Statistically Significant Co-location Patterns


Generate artificial data sets l.jpg

Generate Artificial Data Sets

Observed data

Artificial data sets generated under null model

SSCP: Mining Statistically Significant Co-location Patterns


P value computation l.jpg

p-value computation

If p <= α, PIobsis statistically significant at level α.

p-value = 0.163

α = 0.05

PIobs = 0.41

SSCP: Mining Statistically Significant Co-location Patterns


Auto correlated feature l.jpg

A & B are auto-correlated.

Do not have any true spatial dependency.

Auto-correlated Feature

SSCP: Mining Statistically Significant Co-location Patterns


Modeling auto correlation l.jpg

Modeling Auto-correlation

  • Auto-correlation is modeled as a cluster process.

Poisson Cluster Process [9]

  • Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent.

SSCP: Mining Statistically Significant Co-location Patterns


Estimating summary statistics l.jpg

Estimating Summary Statistics

  • Estimate the summary statistics.

    • Auto-correlated feature: intensity of parent and offspring process (κ, and µ values).

    • Randomly distributed feature: Poisson intensity (either homogenous (a constant) or non-homogenous (a function of x and y)).

SSCP: Mining Statistically Significant Co-location Patterns


Null model design l.jpg

Null Model Design

  • The artificial data sets maintain the following properties of the observed data:

    • same number of instances for each feature, and

    • similar spatial distribution for each individual feature.

SSCP: Mining Statistically Significant Co-location Patterns


P value computation18 l.jpg

p-value computation

  • Estimate

  • Use randomization tests, where a large number of datasets conforming to the null hypothesis is generated.

  • How many simulations do we need?

    • Diggle suggested 500 simulations for α = 0.01 [10].

SSCP: Mining Statistically Significant Co-location Patterns


Improving runtime data generation l.jpg

Improving Runtime: Data Generation

  • In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated)

This saves time of the artificial data generation step of a simulation.

SSCP: Mining Statistically Significant Co-location Patterns


Improving runtime pi value computation l.jpg

No need to compute

  • Procedure:

  • In each simulation, compute -values of all possible 2-size subsets

  • For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets of C. If a subset C' is found for which < PIobs(C), is not required to be computed.

  • Otherwise is computed for simulation Ri.

Improving Runtime: PI-value Computation

  • In a simulation Ri, for a co-location C

SSCP: Mining Statistically Significant Co-location Patterns


An example l.jpg

An Example

Four features A, B, C, D

  • {A,B,C}: If {A,B} < PIobs{A,B,C}, {A,B,C} < PIobs{A,B,C}. No need to compute {A,B,C}.

  • {A,B,C} < PIobs{A,B,C} does not imply {A,B,C,D} < PIobs{A,B,C,D}.

  • {A,B,C,D}: by checking 2-size subsets

The worst case complexity is O(2n)

  • The size of the largest co-location is much smaller.

  • Largest co-location size is predictable

  • if PIobs(C) = 0, we do not compute -value of C,

  • Our pruning strategies

    All these keep the actual cost in practice less than the worst case cost.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 1 l.jpg

Experimental Results (1)

Negative association:

  • Features ○ and ∆ with 40 instances of each.

  • This synthetic data set is generated using multi-type Strauss process to impose a negative association (inhibition) between these two features.

    Result

    PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 2 l.jpg

Experimental Results (2)

Autocorrelation:

  • #○ = 100, and #∆ = 120.

  • ∆: independently and uniformly distributed over the space

    ○: spatially auto-correlated

    In our generated data, ∆ is found in most clusters of ○.

  • The summary statistics of ○ is estimated by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05).

    Results:

  • PIobs {○, ∆} = 0.49, existing algorithm will report the pattern if a threshold <= 0.49 is chosen.

  • p-value = 0.383 > 0.05 (α); {○, ∆} is notreported.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 3 l.jpg

Experimental Results (3)

Multiple features:

#○ = 40, #∆ = 40, #+ = 118, #x = 40, and = #30.

  • Study area = Unit square, co-location neighborhood radius = 0.1

  • Features ○ and ∆ are negatively associated.

  • Feature + is spatially auto-correlated.

    Features +, ○, and x are positively associated.

  • Feature is randomly distributed.

Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, },

{+, x, }, and {○, +, x, }.

SSCP: Mining Statistically Significant Co-location Patterns


Runtime comparison 1 l.jpg

Runtime Comparison (1)

  • Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400 instances.

  • Feature x: is randomly distributed, and has 20 instances.

  • Our algorithm finds all co-locations of features ○, ∆, and x.

  • Instances of each auto-correlated features is increased

    • cluster numbers is kept same

    • number of instances per cluster is increased by a factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns


Runtime comparison 2 l.jpg

Runtime Comparison (2)

  • The number of clusters for features ○, ∆, and + is increased by a factor k but the number of instances per cluster is kept same.

  • Total instances of x is increased by the same factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns


Ants data l.jpg

Ants Data

  • ○ = Cataglyphis ants (29) and ∆ = Messor ants (68).

  • PIobs {Cataglyphis, Messor} = {24/29, 30/68} = 0.44.

  • p-value = 0.142 > 0.05 (α); Co-location {○, ∆} is not significant.

  • R. D. Harkness also did not find any clear association between these two species.

  • Existing algorithm will report {○, ∆} if PI-threshold <= 0.44.

SSCP: Mining Statistically Significant Co-location Patterns


Toronto address repository data l.jpg

Toronto Address Repository Data

SSCP: Mining Statistically Significant Co-location Patterns


Found co locations l.jpg

Found Co-locations

SSCP: Mining Statistically Significant Co-location Patterns


Conclusions l.jpg

Conclusions

  • A new definition for co-location pattern.

  • Does not depend on a global threshold.

  • Statistically meaningful.

  • Runtime cost of randomization tests is reduced.

  • Investigate other prevalence measures to check if they allow additional pruning techniques.

  • Removing redundant patterns.

SSCP: Mining Statistically Significant Co-location Patterns


References l.jpg

References

  • 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994)

  • 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001)

  • 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004)

  • 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995)

  • 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001)

  • 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004)

  • 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006)

  • 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250-259 (2008).

  • 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns.

  • 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003

SSCP: Mining Statistically Significant Co-location Patterns


Slide32 l.jpg

Questions?

SSCP: Mining Statistically Significant Co-location Patterns


  • Login