SSCP: Mining Statistically Significant Co-location Patterns - PowerPoint PPT Presentation

Sscp mining statistically significant co location patterns l.jpg
Download
1 / 32

  • 132 Views
  • Uploaded on
  • Presentation posted in: General

SSCP: Mining Statistically Significant Co-location Patterns. Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada. Outline. Introduction Related work Motivation Proposed Method Experimental evaluation Synthetic data Real data Conclusions. Definition.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

SSCP: Mining Statistically Significant Co-location Patterns

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sscp mining statistically significant co location patterns l.jpg

SSCP: Mining Statistically Significant Co-location Patterns

Sajib Barua and Jörg Sander

Dept. of Computing Science

University of Alberta, Canada


Outline l.jpg

Outline

  • Introduction

    • Related work

    • Motivation

  • Proposed Method

  • Experimental evaluation

    • Synthetic data

    • Real data

  • Conclusions

SSCP: Mining Statistically Significant Co-location Patterns


Definition l.jpg

Definition

  • Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity.

  • Examples:

{Shopping mall, parking}

{Nile crocodile, Egyptian plover}

SSCP: Mining Statistically Significant Co-location Patterns


Event centric model l.jpg

{A2, B1, C1} is an instance of co-location {A,B,C}

{A2, B1, D1} is an instance of co-location {A,B,D}

{A2, C1, D1} is an instance of co-location {A,C,D}

{B1, C1, D1} is an instance of co-location {B,C,D}

{A2, B1, C1, D1} is an instance of co-location {A, B,C,D}

{A2, B1, C1} is an instance of co-location {A,B,C}

B2

B2

C1

C1

B2

C1

{A2, B1, C1, D1} form a clique under a relation R.

C2

C2

C2

C3

C3

B1

B1

B1

C3

A2

A2

A2

D1

D1

D1

Event Centric Model

  • Co-location is defined based on a spatial relationship R

  • A co-location type C is a set of n different spatial features f1, f2, …, and fn.

SSCP: Mining Statistically Significant Co-location Patterns


Prevalence measure l.jpg

PI ({A,B}) = min {1/2, 1/2} = 0.5

PI ({A, B}) = min {1/2, 1/2} = 0.5

PI ({B, C}) = min {1, 2/3} = 0.66

PI ({A, C}) = min {1/2, 1/3} = 0.33

PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33

PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C})

B2

B2

C1

C1

A1

A1

C2

C2

B1

B1

C3

C3

A2

A2

Prevalence Measure

  • Participation ratio (PR) of a feature in a co-location type C, is the fraction of its instances participating in any instance of C.

  • Participation index (PI) is the minimum participation ratio in C.

PR and PI are anti-monotonic

SSCP: Mining Statistically Significant Co-location Patterns


Related work l.jpg

Related Work

  • Spatial statistics

    • Ripley’s K function, distance based measure,

      co-variogram function.

  • Spatial data mining

    • Koperski et al. [4] mine spatial association rules.

    • Morimoto [5] also look for frequently occurring patterns.

    • Shekhar et al. [2] introduce three models to materialize transaction.

    • Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8].

SSCP: Mining Statistically Significant Co-location Patterns


Limitations of the existing methods l.jpg

Limitations of the Existing Methods

  • Spatial statistics

    • Defined only for pairs.

  • Co-location mining

    • Only one global threshold for PI is used.

    • No guideline to setup PI-threshold

    • Do not address the spatial auto-correlation and feature abundance effects.

A simple threshold can report meaningless patterns or can miss meaningful patterns.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation l.jpg

A has fewer instances

B is abundant

A & B have true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will not report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation9 l.jpg

A & B are abundant.

Both randomly distributed.

Do not have any true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Motivation10 l.jpg

A & B are auto-correlated.

Do not have any true spatial dependency.

Motivation

Assume PI-threshold = 0.4

Existing co-location mining algorithms will report{A,B}.

SSCP: Mining Statistically Significant Co-location Patterns


Our idea l.jpg

Our Idea

  • Our approach uses statistical test.

  • Spatial dependency is measured using PI.

#○ = 12

#∆ = 12

If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PI-value (0.41)?

SSCP: Mining Statistically Significant Co-location Patterns


Generate artificial data sets l.jpg

Generate Artificial Data Sets

Observed data

Artificial data sets generated under null model

SSCP: Mining Statistically Significant Co-location Patterns


P value computation l.jpg

p-value computation

If p <= α, PIobsis statistically significant at level α.

p-value = 0.163

α = 0.05

PIobs = 0.41

SSCP: Mining Statistically Significant Co-location Patterns


Auto correlated feature l.jpg

A & B are auto-correlated.

Do not have any true spatial dependency.

Auto-correlated Feature

SSCP: Mining Statistically Significant Co-location Patterns


Modeling auto correlation l.jpg

Modeling Auto-correlation

  • Auto-correlation is modeled as a cluster process.

Poisson Cluster Process [9]

  • Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent.

SSCP: Mining Statistically Significant Co-location Patterns


Estimating summary statistics l.jpg

Estimating Summary Statistics

  • Estimate the summary statistics.

    • Auto-correlated feature: intensity of parent and offspring process (κ, and µ values).

    • Randomly distributed feature: Poisson intensity (either homogenous (a constant) or non-homogenous (a function of x and y)).

SSCP: Mining Statistically Significant Co-location Patterns


Null model design l.jpg

Null Model Design

  • The artificial data sets maintain the following properties of the observed data:

    • same number of instances for each feature, and

    • similar spatial distribution for each individual feature.

SSCP: Mining Statistically Significant Co-location Patterns


P value computation18 l.jpg

p-value computation

  • Estimate

  • Use randomization tests, where a large number of datasets conforming to the null hypothesis is generated.

  • How many simulations do we need?

    • Diggle suggested 500 simulations for α = 0.01 [10].

SSCP: Mining Statistically Significant Co-location Patterns


Improving runtime data generation l.jpg

Improving Runtime: Data Generation

  • In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated)

This saves time of the artificial data generation step of a simulation.

SSCP: Mining Statistically Significant Co-location Patterns


Improving runtime pi value computation l.jpg

No need to compute

  • Procedure:

  • In each simulation, compute -values of all possible 2-size subsets

  • For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets of C. If a subset C' is found for which < PIobs(C), is not required to be computed.

  • Otherwise is computed for simulation Ri.

Improving Runtime: PI-value Computation

  • In a simulation Ri, for a co-location C

SSCP: Mining Statistically Significant Co-location Patterns


An example l.jpg

An Example

Four features A, B, C, D

  • {A,B,C}: If {A,B} < PIobs{A,B,C}, {A,B,C} < PIobs{A,B,C}. No need to compute {A,B,C}.

  • {A,B,C} < PIobs{A,B,C} does not imply {A,B,C,D} < PIobs{A,B,C,D}.

  • {A,B,C,D}: by checking 2-size subsets

The worst case complexity is O(2n)

  • The size of the largest co-location is much smaller.

  • Largest co-location size is predictable

  • if PIobs(C) = 0, we do not compute -value of C,

  • Our pruning strategies

    All these keep the actual cost in practice less than the worst case cost.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 1 l.jpg

Experimental Results (1)

Negative association:

  • Features ○ and ∆ with 40 instances of each.

  • This synthetic data set is generated using multi-type Strauss process to impose a negative association (inhibition) between these two features.

    Result

    PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 2 l.jpg

Experimental Results (2)

Autocorrelation:

  • #○ = 100, and #∆ = 120.

  • ∆: independently and uniformly distributed over the space

    ○: spatially auto-correlated

    In our generated data, ∆ is found in most clusters of ○.

  • The summary statistics of ○ is estimated by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05).

    Results:

  • PIobs {○, ∆} = 0.49, existing algorithm will report the pattern if a threshold <= 0.49 is chosen.

  • p-value = 0.383 > 0.05 (α); {○, ∆} is notreported.

SSCP: Mining Statistically Significant Co-location Patterns


Experimental results 3 l.jpg

Experimental Results (3)

Multiple features:

#○ = 40, #∆ = 40, #+ = 118, #x = 40, and = #30.

  • Study area = Unit square, co-location neighborhood radius = 0.1

  • Features ○ and ∆ are negatively associated.

  • Feature + is spatially auto-correlated.

    Features +, ○, and x are positively associated.

  • Feature is randomly distributed.

Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, },

{+, x, }, and {○, +, x, }.

SSCP: Mining Statistically Significant Co-location Patterns


Runtime comparison 1 l.jpg

Runtime Comparison (1)

  • Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400 instances.

  • Feature x: is randomly distributed, and has 20 instances.

  • Our algorithm finds all co-locations of features ○, ∆, and x.

  • Instances of each auto-correlated features is increased

    • cluster numbers is kept same

    • number of instances per cluster is increased by a factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns


Runtime comparison 2 l.jpg

Runtime Comparison (2)

  • The number of clusters for features ○, ∆, and + is increased by a factor k but the number of instances per cluster is kept same.

  • Total instances of x is increased by the same factor k.

Runtime comparison

Speedup

SSCP: Mining Statistically Significant Co-location Patterns


Ants data l.jpg

Ants Data

  • ○ = Cataglyphis ants (29) and ∆ = Messor ants (68).

  • PIobs {Cataglyphis, Messor} = {24/29, 30/68} = 0.44.

  • p-value = 0.142 > 0.05 (α); Co-location {○, ∆} is not significant.

  • R. D. Harkness also did not find any clear association between these two species.

  • Existing algorithm will report {○, ∆} if PI-threshold <= 0.44.

SSCP: Mining Statistically Significant Co-location Patterns


Toronto address repository data l.jpg

Toronto Address Repository Data

SSCP: Mining Statistically Significant Co-location Patterns


Found co locations l.jpg

Found Co-locations

SSCP: Mining Statistically Significant Co-location Patterns


Conclusions l.jpg

Conclusions

  • A new definition for co-location pattern.

  • Does not depend on a global threshold.

  • Statistically meaningful.

  • Runtime cost of randomization tests is reduced.

  • Investigate other prevalence measures to check if they allow additional pruning techniques.

  • Removing redundant patterns.

SSCP: Mining Statistically Significant Co-location Patterns


References l.jpg

References

  • 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994)

  • 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001)

  • 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004)

  • 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995)

  • 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001)

  • 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004)

  • 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006)

  • 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250-259 (2008).

  • 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns.

  • 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003

SSCP: Mining Statistically Significant Co-location Patterns


Slide32 l.jpg

Questions?

SSCP: Mining Statistically Significant Co-location Patterns


  • Login