Pairwise Nearest Neighbor Method Revisited

Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna Olli Virmajoki UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND 11.12.2004

Clustering • Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in • data analysis • pattern recognition • data mining • other fields of science and engineering • Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups

Example of data sets Employment statistics RGB-data

Summary of data sets

Data sets

An example of clustering

Clustering • Given a set of N data vectors X={x1, x2, ...XN} in K-dimensional space, clustering aims at solving the partition P={p1, p2, ...pN}, which defines for each data vector the index of the cluster where it is assigned to. • Cluster sa = {xi|pi=a} • Clustering S={s1, s2, ...,sM} • Codebook C={c1, c2, ...,cM} • Cost function • Combinatorial optimization problem

Clustering algorithms • Heuristic methods • Optimization methods • K-means • Genetic algorithms • Graph-theoretical methods • Hierarchical methods • Divisive • Agglomerative (yhdistelevä)

Agglomerative clustering N = 22 ( number of data points ) M = 3 ( number of final clusters )

Ward’s method (PNN in VQ) Merge cost: Local optimization strategy: • Nearest neighbor search is needed: • finding the cluster pair to be merged • updating of NN pointers

The PNN method M=5000 M=50 M=5000 M=4999 M=4988 . . . M=50 . . M=16 M=15 M=15 M=16

Nearest neighbor pointers Fast exaxt PNN method: Reduces the amount of the nearest neighbor searches in each iteration:O(N 3) Ω (N 2)

Combining the PNN and k-means

PNNas a crossover method in the genetic algorithm Initial1 Initial2 Two random codebooks M=15 Union Combined Result of PNN Combined codebook M=30 and final codebook M=15 PNN

Publication 1:Speed-up methods • Partial distortion search (PDS) • Mean-distance-ordered search (MPS) • Uses the component means of the vectors • Derives a precondition for the distance calculations • Reduction of the run time to 2 to 15%

Example of the MPS method Input vector Best candidate

Publication 2:Graph-based PNN • Based on the exact PNN method • NN search is limited only to the k clusters that are connected by the graph structure • Reduces the time complexity of every search from O(N) to O(k) • Reduction in the run time to 1 to 4%

Why graph structure ? Only O(k) searches with the graph structure ! (k = 3) O(N) searches with the full search (N=4096)

Sample graph

Publication 3:Multilevel thresholding • Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensional • Existing method (N 2) • PNN thresholding can be implemented in O(N·logN) • The proposed method works in real time for any number of thresholds

Distances in heap structure O(log N) O(1)

Publication 4:Iterative shrinking (IS) • Generates the clustering by a sequence of cluster removal operations • In the IS method the vectors can be reassigned more freely than in the PNN method • Can be applied as a crossover method in the genetic algorithm (GAIS) • GAIS outperforms all other clustering algorithms

Example of the PNN method

Example of the iterative shrinking method

The PNN and IS in the search of the number of clusters

Time-distortion performance

Publication 5:Optimal clustering • Can be found by considering all possible merge sequences and finding the one that minimizes the optimization function • Can be implemented as a branch-and-bound (BB) technique • Two suboptimal, but polynomial, time variants: • Piecewise optimization • Look-Ahead optimization

Example of non-redundant search tree Branches that do not have any valid clustering have been cut out

Illustration of the Piecewise optimization

Comparative results

Example of clustering k-means agglomerative clustering

Conclusions • Several speed-up methods • Projection-based search • Partial distortion search • k nearest neighbor graph • Efficient O(N·logN) time implementation for the 1-dimensional case • Generalization of the merge phase by cluster removal philosofy (IS) for better quality • Optimal clustering based on the PNN method

Pairwise Nearest Neighbor Method Revisited