clustering and partitioning for spatial and temporal data mining l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering and Partitioning for Spatial and Temporal Data Mining PowerPoint Presentation
Download Presentation
Clustering and Partitioning for Spatial and Temporal Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 43

Clustering and Partitioning for Spatial and Temporal Data Mining - PowerPoint PPT Presentation


  • 345 Views
  • Uploaded on

Clustering and Partitioning for Spatial and Temporal Data Mining. Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA www.cis.temple.edu/~vasilis. Outline. Introduction Motivation – Problems:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering and Partitioning for Spatial and Temporal Data Mining' - jaden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering and partitioning for spatial and temporal data mining

Clustering and Partitioning for Spatial and Temporal Data Mining

Vasilis Megalooikonomou

Data Engineering Laboratory (DEnLab)

Dept. of Computer and Information Sciences

Temple University

Philadelphia, PA

www.cis.temple.edu/~vasilis

V. Megalooikonomou, Temple University

outline
Outline
  • Introduction
    • Motivation – Problems:
      • Spatial domain
      • Time domain
    • Challenges
  • Spatial data
    • Partitioning and Clustering
    • Detection of discriminative patterns
    • Results
  • Temporal data
    • Partitioning
    • Vector Quantization
    • Results
  • Conclusions - Discussion

V. Megalooikonomou, Temple University

introduction
Introduction
  • Large spatial and temporal databases
  • Meta-analysis of data pooled from multiple studies
  • Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data

V. Megalooikonomou, Temple University

problem
Problem

Spatial Data Mining:

Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.:

  • associations among image data or among image and non-image data
  • discriminative areas among groups of images
  • rules/patterns
  • similar images to a query image (queries by content)

V. Megalooikonomou, Temple University

challenges
Challenges
  • How to apply data mining techniques to images?
  • Learning from images directly
  • Heterogeneity and variability of image data
  • Preprocessing (segmentation, spatial normalization, etc)
  • Exploration of high correlation between neighboring objects
  • Large dimensionality
  • Complexity of associations
  • Efficient management of topological/distance information
  • Spatial knowledge representation / Spatial Access Methods (SAMs)

V. Megalooikonomou, Temple University

example association mining spatial data

i1

i2

i3

i4

i5

i6

i7

c1

c7

c2

c2

c1

c3

c9

c6

Example: Association Mining – Spatial Data
  • Discover associations among spatial and non-spatial data:
    • Images {i1, i2,…, iL}
    • Spatial regions {s1, s2,…, sK}
    • Non-spatial variables {c1, c2,…, cM}

V. Megalooikonomou, Temple University

example fmri contrast maps
Example: fMRI contrast maps

Patient

Control

V. Megalooikonomou, Temple University

applications
Applications

Medical Imaging, Bioinformatics, Geography, Meteorology, etc..

V. Megalooikonomou, Temple University

voxel based analysis
Voxel-based Analysis
  • No model on the image data
  • Each voxel’s changes analyzed independently - a map of statistical significance is built
  • Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc)
  • Statistical Parametric Mapping (SPM)
  • Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars)
  • Cluster voxels by findings

[V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]

V. Megalooikonomou, Temple University

analysis by grouping of voxels
Analysis by grouping of voxels
  • Grouping of voxels (atlas-based)
    • Prior knowledge increases sensitivity
    • Data reduction: 107 voxels R regions (structures)
    • Map a ROI onto at least one region
    • As good as the atlas being used
  • M non-spatial variables, R regions
  • Analysis
  • Categorical structural variables
  • M x R contingency tables, Chi-square/Fisher exact test
  • multiple comparison problem
  • log-linear analysis, multivariate Bayesian
  • Continuous structural variables
  • Logistic regression, Mann-Whitney

V. Megalooikonomou, Temple University

dynamic recursive partitioning
Dynamic Recursive Partitioning
  • Adaptive partitioning of a 3D volume

V. Megalooikonomou, Temple University

dynamic recursive partitioning12
Dynamic Recursive Partitioning
  • Adaptive partitioning of a 3D volume
  • Partitioning criterion:
    • discriminative power of feature(s) of hyper-rectangle and
    • size of hyper-rectangle

V. Megalooikonomou, Temple University

dynamic recursive partitioning13
Dynamic Recursive Partitioning
  • Adaptive partitioning of a 3D volume
  • Partitioning criterion:
    • discriminative power of feature(s) of hyper-rectangle and
    • size of hyper-rectangle

V. Megalooikonomou, Temple University

dynamic recursive partitioning14
Dynamic Recursive Partitioning
  • Adaptive partitioning of a 3D volume
  • Partitioning criterion:
    • discriminative power of feature(s) of hyper-rectangle and
    • size of hyper-rectangle

V. Megalooikonomou, Temple University

dynamic recursive partitioning15
Dynamic Recursive Partitioning
  • Adaptive partitioning of a 3D volume
  • Partitioning criterion:
    • discriminative power of feature(s) of hyper-rectangle and
    • size of hyper-rectangle
  • Extract features from discriminative regions
  • Reduce multiple comparison problem
    • (# tests = # partitions < # voxels)
  • tests downward closed

[V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]

V. Megalooikonomou, Temple University

other methods for spatial data classification

*

*

*

*

*

*

*

*

*

Other Methods for Spatial Data Classification

Distinguishing among distributions:

  • Distributional Distances:
    • - Mahalanobis distance
    • - Kullback-Leibler divergence (parametric, non-parametric)
  • Maximum Likelihood:
    • - Estimate probability densities and compute likelihood
      • EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian)
  • Static partitioning:
    • Reduction of the # of attributes as compared to voxel-wise analysis
    • Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization
  • D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005.

V. Megalooikonomou, Temple University

experimental results

Number of tests

Thresh.

Depth

DRP

Voxel Wise

0.05

3

569

201774

0.05

4

4425

201774

0.01

4

4665

201774

Experimental Results

Areas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance

Comparison of number of tests performed

[D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004]

V. Megalooikonomou, Temple University

experimental results18
Experimental Results

Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data

(a)

Impact:

  • Assist in interpretation of images (e.g., facilitating diagnosis)
  • Enable researchers to integrate, manipulate and analyze large volumes of image data

(b)

V. Megalooikonomou, Temple University

time sequence analysis
Time Sequence Analysis

Time Sequence:A sequence (ordered collection) of real values: X = x1, x2,…, xn

  • Time series data abound in many applications …
  • Challenges:
    • High dimensionality
    • Large number of sequences
    • Similarity metric definition
  • Similarity analysis (e.g., find stocks similar to that of IBM)
  • Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns
  • Applications: clustering, classification, similarity searches, summarization

V. Megalooikonomou, Temple University

dimensionality reduction techniques
Dimensionality Reduction Techniques
  • DFT: Discrete Fourier Transform
  • DWT: Discrete Wavelet Transform
  • SVD: Singular Value Decomposition
  • APCA: Adaptive Piecewise Constant Approximation
  • PAA: Piecewise Aggregate Approximation
  • SAX: Symbolic Aggregate approXimation

V. Megalooikonomou, Temple University

similarity distances for time series
Similarity distances for time series
  • Euclidean Distance:
    • most common, sensitive to shifts
  • Dynamic Time Warping:
    • improving accuracy but slow: O(n2)
  • Envelope-based DTW:
    • faster: O(n)

A more intuitive idea:

two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)

V. Megalooikonomou, Temple University

partitioning piecewise constant approximations
Partitioning – Piecewise Constant Approximations

Original time series

(n points)

Piecewise constant approximation (PCA)

or Piecewise Aggregate Approximation

(PAA), [Yi and Faloutsos ’00, Keogh et

al, ’00] (n' segments)

Adaptive Piecewise Constant

Approximation (APCA), [Keogh et al., ’01] (n" segments)

V. Megalooikonomou, Temple University

multiresolution vector quantized approximation mvq
Multiresolution Vector Quantized approximation (MVQ)

Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences

1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved

2) Takes multiple resolutions into account – keeps both local and global information

3) Unlike wavelets partially ignores the ordering of ‘codewords’

3) Can exploit prior knowledge about the data

4) Employs a new distance metric

[V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]

V. Megalooikonomou, Temple University

methodology

l

Codebook s=16

Generation

s

Series Transformation

1121000000001000

1200010011000000

1000000012001100

1000000011002100

0001010100110010

1010000100100011

……

c mdbca i fajbb

m i njjama I njm

h ldfkophcako

o gcblpoccblh

l hnkkkplcacg

k kgjhhgkgjlp

Series

Encoding

……

Methodology

V. Megalooikonomou, Temple University

methodology25

Q: How to create?

A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)

Representing time series

X = x1, x2,…, xn

is encoded with a new representation

f = (f1,f2,…, fs)

Methodology
  • Creating a ‘vocabulary’

Frequently appearing patterns in subsequences

  • Output:
  • A codebook with s codewords

(fi is the frequency of the i th codeword in X)

V. Megalooikonomou, Temple University

methodology26

fi,t

fi,q

Methodology

New distance metric:

The histogram model is used to calculate similarity at each resolution level:

with

1 2...s

V. Megalooikonomou, Temple University

methodology27
Methodology
  • Time series summarization:
  • High level information (frequently appearing patterns) is more useful
  • The new representation can provide this kind of information

Both codeword (pattern) 3 & 5 show up 2 times

V. Megalooikonomou, Temple University

methodology28
Methodology

Problems of frequency based encoding:

  • It is hard to define an approximate resolution (codeword length)
  • It may lose global information

V. Megalooikonomou, Temple University

methodology29
Methodology

Solution: Use multiple resolutions:

  • It is hard to define an approximate resolution (codeword length)
  • It may lose global information

V. Megalooikonomou, Temple University

methodology30
Methodology

Proposed distance metric:

Weighted sum of similarities, at all resolution levels

  • where c is the number of resolution levels
  • lacking any prior knowledge equal weights to all resolution levels works well most of the time

similarity @ level i

V. Megalooikonomou, Temple University

mvq example of codebooks
MVQ: Example of Codebooks
  • Codebook for the first level
  • Codebook for the second level (more codewords since there are more details)

V. Megalooikonomou, Temple University

experiments
Experiments

Datasets

  • SYNDATA (control chart data): synthetic
  • CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program
  • RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

V. Megalooikonomou, Temple University

experiments33
Experiments

Best Match Searching:

Matching accuracy: % of knn’s (found by different approaches) that are in same class

V. Megalooikonomou, Temple University

experiments34
Experiments

Best Match Searching

SYNDATA

CAMMOUSE

V. Megalooikonomou, Temple University

experiments35
Experiments

Best Match Searching

MVQ

MVQ

(a) (b)

Precision-recall for different methods

(a) on SYNDATA dataset (b) on CAMMOUSE dataset

V. Megalooikonomou, Temple University

experiments36
Experiments

Clustering experiments

Given two clusterings, G=G1, G2, …, GK(the true clusters), and A = A1, A2, …, Ak (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as:

with

V. Megalooikonomou, Temple University

[Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]

experiments37
Experiments

Clustering experiments

SYNDATA

RTT

.

V. Megalooikonomou, Temple University

mvq example two time series
MVQ: Example: Two Time Series
  • Given two time series t1 and t2 as follows:
  • In the first level, they are encoded with the same codeword (3), so they are not distinguishable
  • In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12.

V. Megalooikonomou, Temple University

analysis of images by projection to 1d
Analysis of images by projection to 1D
  • Hilbert Space Filling Curve
  • Binning
  • Statistical tests of significance on groups of points
  • Identification of discriminative areas by back-projection

(a)

(b)

(c)

(a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its Vmeanmeasurement, (c) the discriminative voxels after applying the t-test with θ=0.05

[D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]

V. Megalooikonomou, Temple University

applying time series techniques
Applying time series techniques

(a)

(b)

Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance.

Results: 87%-98% classification accuracy (t-test, CATX)

  • Variation: Concatenate the values of statistically significant areas  spatial sequences
  • Pattern analysis using the similarity between spatial sequences and time sequences
    • SVD, DFT, DWT, PCA (clustering accuracy: 89-100%)

V. Megalooikonomou, Temple University

[Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]

conclusions
Conclusions
  • ‘Find patterns/interesting things’ efficiently and robustly in spatial and temporal data
  • Use of partitioning and clustering
  • Analysis at multiple resolutions
  • Reduction of the number of tests performed
  • Intelligent exploration of the space to find discriminative areas
  • Reduction of dimensionality
  • Symbolic representation
  • Nice summarization

V. Megalooikonomou, Temple University

collaborators
Collaborators

Faculty:

  • Zoran Obradovic
  • Orest Boyko
  • James Gee
  • Andrew Saykin
  • Christos Faloutsos
  • Christos Davatzikos
  • Edward Herskovits
  • Fillia Makedon
  • Dragoljub Pokrajac
  • Students:
  • Despina Kontos
  • Qiang Wang
  • Guo Li
  • Others:
  • James Ford
  • Alexandar Lazarevic

V. Megalooikonomou, Temple University

slide43

Thank you!

Acknowledgements

This research has been funded by:

  • National Science Foundation CAREER award 0237921
  • National Science Foundation Grant 0083423
  • National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA

V. Megalooikonomou, Temple University