Loading in 2 Seconds...

Clustering and Partitioning for Spatial and Temporal Data Mining

Loading in 2 Seconds...

- By
**jaden** - Follow User

- 345 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Clustering and Partitioning for Spatial and Temporal Data Mining' - jaden

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Clustering and Partitioning for Spatial and Temporal Data Mining

Vasilis Megalooikonomou

Data Engineering Laboratory (DEnLab)

Dept. of Computer and Information Sciences

Temple University

Philadelphia, PA

www.cis.temple.edu/~vasilis

V. Megalooikonomou, Temple University

Outline

- Introduction
- Motivation – Problems:
- Spatial domain
- Time domain
- Challenges
- Spatial data
- Partitioning and Clustering
- Detection of discriminative patterns
- Results
- Temporal data
- Partitioning
- Vector Quantization
- Results
- Conclusions - Discussion

V. Megalooikonomou, Temple University

Introduction

- Large spatial and temporal databases
- Meta-analysis of data pooled from multiple studies
- Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data

V. Megalooikonomou, Temple University

Problem

Spatial Data Mining:

Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.:

- associations among image data or among image and non-image data
- discriminative areas among groups of images
- rules/patterns
- similar images to a query image (queries by content)

V. Megalooikonomou, Temple University

Challenges

- How to apply data mining techniques to images?
- Learning from images directly
- Heterogeneity and variability of image data
- Preprocessing (segmentation, spatial normalization, etc)
- Exploration of high correlation between neighboring objects
- Large dimensionality
- Complexity of associations
- Efficient management of topological/distance information
- Spatial knowledge representation / Spatial Access Methods (SAMs)

V. Megalooikonomou, Temple University

i2

i3

i4

i5

i6

i7

c1

c7

c2

c2

c1

c3

c9

c6

Example: Association Mining – Spatial Data- Discover associations among spatial and non-spatial data:
- Images {i1, i2,…, iL}
- Spatial regions {s1, s2,…, sK}
- Non-spatial variables {c1, c2,…, cM}

V. Megalooikonomou, Temple University

Applications

Medical Imaging, Bioinformatics, Geography, Meteorology, etc..

V. Megalooikonomou, Temple University

Voxel-based Analysis

- No model on the image data
- Each voxel’s changes analyzed independently - a map of statistical significance is built
- Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc)
- Statistical Parametric Mapping (SPM)
- Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars)
- Cluster voxels by findings

[V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]

V. Megalooikonomou, Temple University

Analysis by grouping of voxels

- Grouping of voxels (atlas-based)
- Prior knowledge increases sensitivity
- Data reduction: 107 voxels R regions (structures)
- Map a ROI onto at least one region
- As good as the atlas being used

- M non-spatial variables, R regions

- Analysis

- Categorical structural variables

- M x R contingency tables, Chi-square/Fisher exact test
- multiple comparison problem
- log-linear analysis, multivariate Bayesian

- Continuous structural variables

- Logistic regression, Mann-Whitney

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

- Adaptive partitioning of a 3D volume

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

- Adaptive partitioning of a 3D volume
- Partitioning criterion:
- discriminative power of feature(s) of hyper-rectangle and
- size of hyper-rectangle

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

- Adaptive partitioning of a 3D volume
- Partitioning criterion:
- discriminative power of feature(s) of hyper-rectangle and
- size of hyper-rectangle

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

- Adaptive partitioning of a 3D volume
- Partitioning criterion:
- discriminative power of feature(s) of hyper-rectangle and
- size of hyper-rectangle

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

- Adaptive partitioning of a 3D volume
- Partitioning criterion:
- discriminative power of feature(s) of hyper-rectangle and
- size of hyper-rectangle
- Extract features from discriminative regions
- Reduce multiple comparison problem
- (# tests = # partitions < # voxels)
- tests downward closed

[V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]

V. Megalooikonomou, Temple University

*

*

*

*

*

*

*

*

Other Methods for Spatial Data ClassificationDistinguishing among distributions:

- Distributional Distances:
- - Mahalanobis distance
- - Kullback-Leibler divergence (parametric, non-parametric)
- Maximum Likelihood:
- - Estimate probability densities and compute likelihood
- EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian)
- Static partitioning:
- Reduction of the # of attributes as compared to voxel-wise analysis
- Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization

- D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005.

V. Megalooikonomou, Temple University

Thresh.

Depth

DRP

Voxel Wise

0.05

3

569

201774

0.05

4

4425

201774

0.01

4

4665

201774

Experimental ResultsAreas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance

Comparison of number of tests performed

[D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004]

V. Megalooikonomou, Temple University

Experimental Results

Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data

(a)

Impact:

- Assist in interpretation of images (e.g., facilitating diagnosis)
- Enable researchers to integrate, manipulate and analyze large volumes of image data

(b)

V. Megalooikonomou, Temple University

Time Sequence Analysis

Time Sequence:A sequence (ordered collection) of real values: X = x1, x2,…, xn

- Time series data abound in many applications …
- Challenges:
- High dimensionality
- Large number of sequences
- Similarity metric definition
- Similarity analysis (e.g., find stocks similar to that of IBM)
- Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns
- Applications: clustering, classification, similarity searches, summarization

V. Megalooikonomou, Temple University

Dimensionality Reduction Techniques

- DFT: Discrete Fourier Transform
- DWT: Discrete Wavelet Transform
- SVD: Singular Value Decomposition
- APCA: Adaptive Piecewise Constant Approximation
- PAA: Piecewise Aggregate Approximation
- SAX: Symbolic Aggregate approXimation
- …

V. Megalooikonomou, Temple University

Similarity distances for time series

- Euclidean Distance:
- most common, sensitive to shifts

- Dynamic Time Warping:
- improving accuracy but slow: O(n2)

- Envelope-based DTW:
- faster: O(n)

A more intuitive idea:

two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)

V. Megalooikonomou, Temple University

Partitioning – Piecewise Constant Approximations

Original time series

(n points)

Piecewise constant approximation (PCA)

or Piecewise Aggregate Approximation

(PAA), [Yi and Faloutsos ’00, Keogh et

al, ’00] (n' segments)

Adaptive Piecewise Constant

Approximation (APCA), [Keogh et al., ’01] (n" segments)

V. Megalooikonomou, Temple University

Multiresolution Vector Quantized approximation (MVQ)

Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences

1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved

2) Takes multiple resolutions into account – keeps both local and global information

3) Unlike wavelets partially ignores the ordering of ‘codewords’

3) Can exploit prior knowledge about the data

4) Employs a new distance metric

[V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]

V. Megalooikonomou, Temple University

Codebook s=16

Generation

s

Series Transformation

1121000000001000

1200010011000000

1000000012001100

1000000011002100

0001010100110010

1010000100100011

……

c mdbca i fajbb

m i njjama I njm

h ldfkophcako

o gcblpoccblh

l hnkkkplcacg

k kgjhhgkgjlp

Series

Encoding

……

MethodologyV. Megalooikonomou, Temple University

A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)

Representing time series

X = x1, x2,…, xn

is encoded with a new representation

f = (f1,f2,…, fs)

Methodology- Creating a ‘vocabulary’

Frequently appearing patterns in subsequences

- Output:
- A codebook with s codewords

(fi is the frequency of the i th codeword in X)

V. Megalooikonomou, Temple University

fi,q

MethodologyNew distance metric:

The histogram model is used to calculate similarity at each resolution level:

with

1 2...s

V. Megalooikonomou, Temple University

Methodology

- Time series summarization:
- High level information (frequently appearing patterns) is more useful
- The new representation can provide this kind of information

Both codeword (pattern) 3 & 5 show up 2 times

V. Megalooikonomou, Temple University

Methodology

Problems of frequency based encoding:

- It is hard to define an approximate resolution (codeword length)

- It may lose global information

V. Megalooikonomou, Temple University

Methodology

Solution: Use multiple resolutions:

- It is hard to define an approximate resolution (codeword length)

- It may lose global information

V. Megalooikonomou, Temple University

Methodology

Proposed distance metric:

Weighted sum of similarities, at all resolution levels

- where c is the number of resolution levels
- lacking any prior knowledge equal weights to all resolution levels works well most of the time

similarity @ level i

V. Megalooikonomou, Temple University

MVQ: Example of Codebooks

- Codebook for the first level
- Codebook for the second level (more codewords since there are more details)

V. Megalooikonomou, Temple University

Experiments

Datasets

- SYNDATA (control chart data): synthetic

- CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program

- RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

V. Megalooikonomou, Temple University

Experiments

Best Match Searching:

Matching accuracy: % of knn’s (found by different approaches) that are in same class

V. Megalooikonomou, Temple University

Experiments

Best Match Searching

MVQ

MVQ

(a) (b)

Precision-recall for different methods

(a) on SYNDATA dataset (b) on CAMMOUSE dataset

V. Megalooikonomou, Temple University

Experiments

Clustering experiments

Given two clusterings, G=G1, G2, …, GK(the true clusters), and A = A1, A2, …, Ak (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as:

with

V. Megalooikonomou, Temple University

[Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]

MVQ: Example: Two Time Series

- Given two time series t1 and t2 as follows:

- In the first level, they are encoded with the same codeword (3), so they are not distinguishable

- In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12.

V. Megalooikonomou, Temple University

Analysis of images by projection to 1D

- Hilbert Space Filling Curve
- Binning
- Statistical tests of significance on groups of points
- Identification of discriminative areas by back-projection

(a)

(b)

(c)

(a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its Vmeanmeasurement, (c) the discriminative voxels after applying the t-test with θ=0.05

[D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]

V. Megalooikonomou, Temple University

Applying time series techniques

(a)

(b)

Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance.

Results: 87%-98% classification accuracy (t-test, CATX)

- Variation: Concatenate the values of statistically significant areas spatial sequences
- Pattern analysis using the similarity between spatial sequences and time sequences
- SVD, DFT, DWT, PCA (clustering accuracy: 89-100%)

V. Megalooikonomou, Temple University

[Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]

Conclusions

- ‘Find patterns/interesting things’ efficiently and robustly in spatial and temporal data
- Use of partitioning and clustering
- Analysis at multiple resolutions
- Reduction of the number of tests performed
- Intelligent exploration of the space to find discriminative areas
- Reduction of dimensionality
- Symbolic representation
- Nice summarization

V. Megalooikonomou, Temple University

Collaborators

Faculty:

- Zoran Obradovic
- Orest Boyko
- James Gee
- Andrew Saykin
- Christos Faloutsos
- Christos Davatzikos
- Edward Herskovits
- Fillia Makedon
- Dragoljub Pokrajac

- Students:
- Despina Kontos
- Qiang Wang
- Guo Li
- Others:
- James Ford
- Alexandar Lazarevic

V. Megalooikonomou, Temple University

Acknowledgements

This research has been funded by:

- National Science Foundation CAREER award 0237921
- National Science Foundation Grant 0083423
- National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA

V. Megalooikonomou, Temple University

Download Presentation

Connecting to Server..