340 likes | 432 Views
Explore the Pairwise Nearest Neighbor (PNN) method in clustering algorithms for data analysis and pattern recognition. Discover optimization strategies, heuristic methods, genetic algorithms, and more to efficiently partition data sets. Learn about PNN improvements, speed-up methods, graph-based approaches, and optimal clustering solutions.
E N D
Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna Olli Virmajoki UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND 11.12.2004
Clustering • Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in • data analysis • pattern recognition • data mining • other fields of science and engineering • Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups
Example of data sets Employment statistics RGB-data
Clustering • Given a set of N data vectors X={x1, x2, ...XN} in K-dimensional space, clustering aims at solving the partition P={p1, p2, ...pN}, which defines for each data vector the index of the cluster where it is assigned to. • Cluster sa = {xi|pi=a} • Clustering S={s1, s2, ...,sM} • Codebook C={c1, c2, ...,cM} • Cost function • Combinatorial optimization problem
Clustering algorithms • Heuristic methods • Optimization methods • K-means • Genetic algorithms • Graph-theoretical methods • Hierarchical methods • Divisive • Agglomerative (yhdistelevä)
Agglomerative clustering N = 22 ( number of data points ) M = 3 ( number of final clusters )
Ward’s method (PNN in VQ) Merge cost: Local optimization strategy: • Nearest neighbor search is needed: • finding the cluster pair to be merged • updating of NN pointers
The PNN method M=5000 M=50 M=5000 M=4999 M=4988 . . . M=50 . . M=16 M=15 M=15 M=16
Nearest neighbor pointers Fast exaxt PNN method: Reduces the amount of the nearest neighbor searches in each iteration:O(N 3) Ω (N 2)
PNNas a crossover method in the genetic algorithm Initial1 Initial2 Two random codebooks M=15 Union Combined Result of PNN Combined codebook M=30 and final codebook M=15 PNN
Publication 1:Speed-up methods • Partial distortion search (PDS) • Mean-distance-ordered search (MPS) • Uses the component means of the vectors • Derives a precondition for the distance calculations • Reduction of the run time to 2 to 15%
Example of the MPS method Input vector Best candidate
Publication 2:Graph-based PNN • Based on the exact PNN method • NN search is limited only to the k clusters that are connected by the graph structure • Reduces the time complexity of every search from O(N) to O(k) • Reduction in the run time to 1 to 4%
Why graph structure ? Only O(k) searches with the graph structure ! (k = 3) O(N) searches with the full search (N=4096)
Publication 3:Multilevel thresholding • Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensional • Existing method (N 2) • PNN thresholding can be implemented in O(N·logN) • The proposed method works in real time for any number of thresholds
Distances in heap structure O(log N) O(1)
Publication 4:Iterative shrinking (IS) • Generates the clustering by a sequence of cluster removal operations • In the IS method the vectors can be reassigned more freely than in the PNN method • Can be applied as a crossover method in the genetic algorithm (GAIS) • GAIS outperforms all other clustering algorithms
Publication 5:Optimal clustering • Can be found by considering all possible merge sequences and finding the one that minimizes the optimization function • Can be implemented as a branch-and-bound (BB) technique • Two suboptimal, but polynomial, time variants: • Piecewise optimization • Look-Ahead optimization
Example of non-redundant search tree Branches that do not have any valid clustering have been cut out
Example of clustering k-means agglomerative clustering
Conclusions • Several speed-up methods • Projection-based search • Partial distortion search • k nearest neighbor graph • Efficient O(N·logN) time implementation for the 1-dimensional case • Generalization of the merge phase by cluster removal philosofy (IS) for better quality • Optimal clustering based on the PNN method