Lavneet Singh A.M.I.E.T.E (I.E.T.E), M.C.A., M.B.A (IT),PGDPM,MIEEE

A Comparative Analysis of Clustering Algorithms & Self Organizing Maps as Data Mining & Pattern Recognition Tools Using Real World Data Sets Lavneet Singh A.M.I.E.T.E (I.E.T.E), M.C.A., M.B.A (IT),PGDPM,MIEEE Faculty of Management & Computer Application, R.B.S. College, Agra Lavneet_agra@yahoo.co.in

Introduction • Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. • In today’s Scenario, Artificial Intelligence, Expert and Decision Support System Plays a Crucial Role in Finding the performance and increasing the efficiency even in financial crisis. • Data Mining & Warehousing are the coming trends and tools used in current industry of various domains as Business Intelligence known as ERP,CRM & SCM. • The operation is needed in a number of data mining tasks, such as unsupervised classification and data summation, as well as segmentation of large heterogeneous data sets into smaller homogeneous subsets that can be easily managed, separately modeled and analyzed.

In this study we present a different clustering algorithm used to cluster categorical data. The algorithm, called k- means, is also well known k-means algorithm (MacQueen 1967). Compared to other clustering methods the k-means algorithm and its variants (Anderberg 1973) are efficient in clustering large data sets, thus very suitable for data mining. • In this study we present a comparison among some nonhierarchical and hierarchical clustering algorithms including SOM (Self-Organization Map) neural networks methods. Tested with Telecommunication Users & Iris Flower data set, the comparative algorithms had demonstrated a very good classification performance.

The K means Clustering • The k-means algorithm (MacQueen, 1967) is one of a group of algorithms called partitioning methods. • Selection of the initial k means for k clusters. • Calculation of the dissimilarity between an object and the mean of a clusters. • Allocation of an object to the cluster whose mean is nearest to the objects. • Re-calculation of the mean of a cluster from the objects allocated to it so that the intra cluster dissimilarity is minimized. • Except for the first operation, the other three operations are repeatedly performed in the algorithm until the algorithm converges.

Hierarchical Clustering • In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. • Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings.

Self Organizing Maps (SOM) • The method of self-organizing maps (SOM) is a method of exploratory data analysis used for clustering and projecting multi-dimensional data into a lower-dimensional space to reveal hidden structure of the data. The algorithm used retains local similarity and neighborhood relations between the data items. Based on the idea that the SOM retains local similarity relations of data items those maps can be compared in terms of corresponding neighborhood relations. • The self-organizing map (SOM) is a method to visualize multidimensional data

The SOM is an artificial neural network that uses an unsupervised learning algorithm without prior knowledge how systems input and output are connected. For visualization of the self-organizing map a Unified distance matrix (U-matrix) is used. The analysis has been performed by the SOM & Neural Networks toolbox in Matlab.

Experimental Results • We used the mobile usage data to test classification performance of the algorithm and another large data set selected from a health insurance database to test computational efficiency of the algorithm. The first data set consists of 1000 records, each being described by 32 categorical attributes. • This is a hypothetical data file that concerns a telecommunications company's efforts to reduce churn in their customer base. Each case corresponds to a separate customer and records various demographic and service usage information. A telecommunications provider wants to segment its customer base by service usage patterns. If customers can be classified by usage, the company can offer more attractive packages to its customers. Using the K-Means Cluster Analysis procedure we will find subsets of "similar" customers.

Results & Discussions: • The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters • Variables with large F values provide the greatest separation between clusters. The final cluster centers are computed as the mean for each variable within each final cluster. The final cluster centers reflect the characteristics of the typical case for each cluster.

• Customers in cluster 1 tend to be big spenders who purchase a lot of services. • • Customers in cluster 2 tend to be moderate spenders who purchase the "calling" services. • • Customers in cluster 3 tend to spend very little and do not purchase many services. • • Clusters 1 and 2 are the most similar, which makes sense because they were combined into one cluster in the three-cluster solution. • This shows the Euclidean Distances between the final cluster centers. Greater distances between clusters correspond to greater dissimilarities. • • Clusters 1 and 3 are most different. • • Cluster 2 is approximately equally similar to clusters 1 and 3.

This is a diagnostic plot that helps you to find outliers within clusters. There is a lot of variability in cluster 2, but all the distances are within reason. This table shows that an important grouping is missed in the three-cluster solution. Members of clusters 1 and 2 are largely drawn from cluster 3 in the three-cluster solution, and they are unlikely to be big spenders. • However, members of cluster 1 are highly likely to purchase Internet-related services, which establish them as a distinct and possibly profitable group. Clusters 3 and 4 seem to correspond to clusters 1 and 2 from the three-cluster solution. The distances between the clusters have not changed greatly. • Nearly 25% of cases belong to the newly created group of "E-service" customers, which is very significant to your profits.

Hierarchical Clustering Analysis • A telecommunications provider wants to better understand service usage patterns in its customer base. If services can be clustered by usage, the company can offer more attractive packages to its customers. Using the Hierarchical Cluster Analysis procedure to study the relationships between the various services.

This clustering categorizes in following categories • One cluster includes WIRELESS, PAGER, and VOICE. • Another includes EQUIP, INTERNET, and EBILL. • The last contains TOLLFREE, CALLWAIT, CALLID, FORWARD, and CONFER. • The WIRELESS cluster is closer to the INTERNET cluster than the CALLWAIT cluster.

Self Organizing Map (SOM) Analysis • In pattern recognition problems, we want a neural network to classify inputs into a set of target categories. • The Neural Network Pattern Recognition Tool will help us to select data, create and train a network and evaluate its performance using mean square error and confusion matrices. A two-layer feed-forward network, with sigmoid hidden and output neurons (newpr) can classify vectors arbitrarily well given enough neurons in its hidden layer. The network will be trained with scaled conjugate gradient back propagation (trainscg). In this analysis, we use an Iris Flower dataset consisting of 150 samples of flower data.

Self Organizing Map (SOM) Analysis • The data set consists of 150 samples. "irisInputs" is an 4x150 matrix, whose rows are: • 1. Sepal length in cm • 2. Sepal width in cm • 3. Petal length in cm • 4. Petal width in cm • "irisTargets" is an 3x150 matrix, where each ith column indicates which category the ith iris belongs to with a 1 in one element (and zeros in the other two elements). • 1. Iris Setosa • 2. Iris Versicolour • 3. Iris Virginica • One category of iris is linearly separable, the other two are not. The following diagrams and results show the clustering of Iris dataset using SOM analysis.

Results • Table 2 shows the model summary of the regression for the sample banks. The R-Square of the model equal to 67.1% and the R-Square adjusted of the model equals to 58.7%, both of which are consistent. • This means that 58.7% of the changes in the dependent variable (ROA) are due to the variations of the independent variables used in this model besides supporting the appropriate selection of proxies. • Table 3 shows the result of ANOVA. By using the analysis of variance, it is found that F test of the model is equal to 7.965. This F value is largely higher than the critical value at 1% level of significance for degrees of freedom of 10, which is equal to 2.79. • Thus, it can be concluded that at least two independent variables have significant effect on the dependent variable.

Scope of Future research • Our future work plan is to develop and implement a parallel k-modes algorithm on distributed databases with respect to parallel computing algorithms to cluster data sets with millions of objects. Such an algorithm is required in a number of data mining applications, such as partitioning very large heterogeneous sets of objects into a number of smaller and more manageable homogeneous subsets that can be more easily modeled and analyzed, and detecting under-represented concepts, e.g., fraud in a very large number of insurance claims.

Conclusions & Future Work • The biggest advantage of the k-means algorithm in data mining applications is its efficiency in clustering large data sets. However, its use is limited to numeric values. Further work can be done to improve the k means algorithm by adding new extensions so as to allow us the use of k-means paradigm directly to cluster categorical data without need of data conversion. Because data mining deals with very large data sets, scalability is a basic requirement to the data mining algorithms.

Our experimental results have demonstrated that the k-means algorithm & SOM using Neural Network is indeed scalable to very large and complex data sets in terms of both the number of records and the number of clusters. In fact the hierarchical clustering algorithm is faster than the k-means algorithm because our experiments have shown that the former often needs less iteration to converge than the later.

REFERENCES • Anderberg, M. R. (1973) Cluster Analysis for Applications, Academic Press. • Ball, G. H. and Hall, D. J. (1967) A Clustering Technique for Summarizing Multivariate Data, Behavioral Science, 12, pp. 153-155. • Bobrowski, L. and Bezdek, J. C. (1991) c-Means Clustering with the l1 and l¥ Norms, IEEE Transactions on Systems, Man and Cybernetics, 21(3), pp. 545-554. • Fisher, D. H. (1987) Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning, 2(2), pp.139-172. • Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimisation, and Machine Learning, Addison-Wesley. • Gowda, K. C. and Diday, E. (1991) Symbolic Clustering Using a New Dissimilarity Measure, Pattern Recognition, 24(6), pp. 567-578. • Hand, D. J. (1981) Discrimination and Classification, John Wiley & Sons.

Huang, Z. (1997) Clustering Large Data Sets with Mixed Numeric and Categorical Values, In Proceedings of The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, World Scientific. • Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall. • Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983) Optimisation by Simulated Annealing, Science, 220(4598), pp.671-680. • MacQueen, J. B. (1967) Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297. • Michalski, R. S. and Stepp, R. E. (1983) Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(4), pp. 396-410. • Murtagh, F. (1992) Comments on “Parallel Algorithms for Hierarchical Clustering and Cluster Validity”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(10), pp. 1056-1057. • Murthy, C. A. and Chowdhury, N. (1996) In Search of Optimal Clusters Using Genetic Algorithms, Pattern Recognition Letters, 17, pp. 825-832. • Ralambondrainy,H.(1995)Aconceptual Version of the k-Means Algorithm, Pattern Recognition Letters, 16, pp. 1147-1157.

lavneetagra@gmail.com 9411083787 0562-4012926

Lavneet Singh A.M.I.E.T.E (I.E.T.E), M.C.A., M.B.A (IT),PGDPM,MIEEE