Philippe Biela – Journée ClasSpec - EMD – 6/07/2007

Philippe Biela – Journée ClasSpec - EMD – 6/07/2007

ABSTRACT The paper presents a general framework for time series clustering based with spectral decomposition of the affinity matrix A Gaussian function is used to construct the affinity matrix and develop a gradient based method for self-tuning the variance of the Gaussian function. The approach can be used to cluster both constant and variable length time series. The algorithm is able to discover the optimal number of clusters automatically. Experimental results are presented to show the effectiveness of the method.

Theoretical Background We consider a set of M Time Series with same length d : The data matrix with Time Series is : Consider that we have K clusters, we can suppose that a permuation matrix E exist : Ai represents the i-th cluster and si the number of data in the i-th cluster

We consider the within-cluster scatter dispersion matrix of cluster k : mkis the mean vector of the k-th cluster The total within-cluster scatter matrix is Sw The goal of clustering is to achieve high within similarity and low between-cluster similarity, that is we should minimize trace (Sw) and maximise trace (Sb)

Maximisation of trace (Sb) is equivalent to minimization of trace (Sw) Then the optimization criterion becomes : 2 2 If we consider the block-diagonal matrix Q as : where ek is a column vector containing sk « ones » We can demonstrate that Is equivalent to precedent criterion if we consider : and we relax the constraint of

Then optimization problem becomes : The optimal can be obtained by taking the top K eigenvectors of

The normalization makes : if we assume that data objects are ordered by cluster as : Where Ai represents the data in cluster i The similarity matrix S and normalizesd similarity matrix S’ will become block-diagonal

To find a « good » similarity matrix wich is almost block-diagonal, we use the Gaussien function Then we consider :

Clustering Time Series Algorithm via Spectral Decomposition

In this experiment they use a real EEG dataset which is extracted from the 2nd Wadsworth BCI dataset in BCI2003 competition. The data objects can be generated from 3 classes: the EEG signals evoked by flashes containing targets,the EEG signals evoked by flashes adjacent to targets, and other EEG signals. All the data objects have an equal length 144.

50 EEG signals are randomly choose from each class, all the time series have the same length, therefore it’s the Euclidean distance which is used to measure the pairwise distances of the time series. The results are compared with results from hierarcical agglomerative clustreing (HAC). There are 3 kinds of HAC approaches according to the different similarity measure : Complete-linkage HAC (CHAC) Single-linkage HAC (SHAC) Average-linkage HAC (AHAC)

Philippe Biela – Journée ClasSpec - EMD – 6/07/2007