Fully Automatic Clustering System

Fully Automatic Clustering System Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors : Giuseppe Patane Marco Russo Department of Information Management IEEE Transactions on Neural Networks, vol. 13, no. 6, November 2002

Outline • Motivation • Objective • Introduction • VQ • Previous Works: ELBG • FACS • Results • Conclusion • Personal Opinion • Review

Motivation • Fully automatic clustering? • The number of computations per iteration.

Objective • In this paper, the fully automatic clustering system (FACS) is presented. • The objective is the automatic calculation of the codebook of the right dimension, the desired error being fixed. • In order to save on the number of computations per iteration, greedy techniques are adopted.

Introduction • Cluster Analysis(CA, or clustering). • Vector Quantization (VQ). • Groups (or cells). • Each cell is represented by a vector (called codeword). • The set of the codewords is called the codebook. • The different of CA and VQ. • Grouping data into a certain number of groups so that a loss (or error) function is minimized.

Clustering and VQ

VQ-Definition • The objective of VQ is the representation of a set of feature vectors by a set, , of reference vector in .

VQ-Quantization Error(QE) • Square error(SE) • Weighted square error(WSE)

VQ-Nearest neighbor condition (NNC) • Nearest neighbor condition (NNC): Given a fixed codebook Y, the NNC consists in assigning to each input vector the nearest codeword.

VQ-Centroid condition (CC) • Centroid condition (CC): Given a fixed partition S, the CC concerns the procedure for finding the optimal codebook.

Previous Works: ELBG • The starting point of the research reported in this paper was our previous work: the ELBG [39]. • Initialization. • Partition calculation. According to the NNC (6). • Termination condition check. • ELBG-block execution. • New codebook calculation. According to the CC (9). • Return to Step 2.

A. ELBG-Block • The basic idea of the ELBG-block. • Joining a low-distortion cell with a cell adjacent to it. • A high-distortion cell is split into two smaller ones. • If we define the mean distortion per cell as

A. ELBG-Block

A. ELBG-Block • 1) SoCAs (shift of codeword attempt): • is looked for in a stochastic way.

A. ELBG-Block • Splitting: • We place both and on the principal diagonal of ; in this sense, we can say that the two codewords are near each other. • Executing some local rearrangements. • Union:

A. ELBG-Block • 2) Mean Quantization Error Estimation and Eventual SoC: • After the shift, we have a new codebook (Y’) and a new partition (S’). Therefore, we can calculate the new MQE. • If it is lower than the value we had before the SoCA, this is confirmed. Otherwise, it is rejected.

B. Conderations Regarding the ELBG • Insertions are effected in the regions where the error is higher ; Deletions where the error is lower. • operations are executed locally. • Several insertions or deletions can be effected during the same iteration always working locally.

FACS • Introduction. • The CA/VQ technique whose objective is to automatically find the codebook of the right dimension. • FACS - increase or decrease happens smartly. • To insert new codewords where the QE is higher. • To eliminate them where the error is lower.

FACS iteration

Smart growing phase.

p versus the number of iteration

Smart reduction phase.

FACS • The cell to eliminate is chosen with a probability that is a decreasing function of its distortion.

Behavior of FACS Versus the Number of Iterations and Termination Condition

Discussion about outliers

Result • Introduction. • Comparison With ELBG. • Comparison With GNG and GNG-U. • Comparison With FOSART. • Comparison With the Competitive Agglomeration Algorithm. • Classification.

B. Comparison with ELBG

C. Comparison With GNG and GNG-U. • GNG, GNG-U. Insert codewords until • The prefixed number. • The “performance measure” is fulfilled. • Our case,

D. Comparison With FOSART. • The family of the ART algorithms called FOSART. • They use it also for tasks of VQ.

E. Comparison With the Competitive Agglomeration.

F. Classification • Comparison between FACS and the GCS algorithm for a problem, the two spirals, of supervised classification. • Mode 1: • The input is constituted by 194 2-D vectors representing the two spirals. • The output is the related membership class (0 or 1). • We employed the WSE. • Mode 2: • The clustering phase occurs using only the part of the patterns related to the input, and using SE.

F. Classification(cont.)

Conclusion • FACS, a new algorithm for CA/VQ that is able to autonomously find the number of codewords once the desired quantization error is specified. • In comparison to previous similar works a significative improvement in the running time has been obtained. • Further studies will be made regarding the use of different distortion measures.

Personal Opinion • The starting point of the research reported in this paper was author’s previous work:the ELBG. • The QE is a key index.

Review • Clustering V.S VQ. • Previous works: ELBG. • FACS • Smart Growing • Smart Reduction

Fully Automatic Clustering System