Acknowledgements

Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth PlaleDepartment of Computer ScienceIndiana University CICC quarterly meeting

Acknowledgements • David Wild • Rajarshi Guha • Digital Chemistry • Work funded in part by CICC and Microsoft CICC quarterly meeting

Problem Statements 1. Clustering is an important method to organize thousands of data times into meaningful groups. It is widely applied in chemistry, chemical informatics, biology, drug discovery, etc. However, for large datasets, clustering is a slow process even it’s parallelized and be executed in powerful computer clusters. 2. Multi-core architectures provide large degrees of parallelism. Taking advantage of this requires examination of traditional parallelism approaches. We apply that examination to the DivKmeans clustering method. CICC quarterly meeting

Multi-core Architectures Multi-core processors: combines two or more independent processors into a single package. Diagram of an Intel Core 2 dual core processor, with CPU-local Level 1 caches, a shared, on-die Level 2 cache. CICC quarterly meeting

Clustering Algorithm 1. hierarchical clustering Series of partitioning steps take place, generating a hierarchy of clusters. It includes two families, agglomerative methods, which work from leaves upward, and divisive methods which decompose from a root downward. http://www.digitalchemistry.co.uk/prod_clustering.html CICC quarterly meeting

Clustering Algorithm 2. non-hierarchical clustering Clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them. http://www.digitalchemistry.co.uk/prod_clustering.html CICC quarterly meeting

Divisive KMeans (DivKmeans) Clustering Algorithm Kmeans Method: K: number of clusters, which can be specified. The items are initially randomly assigned to a cluster. The kmeans clustering proceeds by repeated application of a two-step process: 1. The mean vector for all items in each cluster is computed. 2. Items are reassigned to the cluster whose center is closest to the item. Features: The K-means algorithm is stochastic and the results are subject to a random component. The K-means algorithm works very well for well-defined clusters with a clear cluster center. CICC quarterly meeting

… cluster1 Kmeans Method Kmeans Method Original cluster … Kmeans Method … cluster2 Kmeans Method … Divisive KMeans (DivKmeans) Clustering Algorithm Divisive KMeans : A hierarchical kmeans method. In the following discussion, we consider k= 2, i.e. each clustering process accepts one cluster as input, and generates two partitioned clusters as outputs. CICC quarterly meeting

Parallelization of DivKmeans Algorithm for Multicore • Proceeding without Digital Chemistry DivKmeans • Once agreement was reached (Nov 2006), could not get version of source code isolated that communicated with public interfaces instead of private interfaces. • Naive parallelization of DivKmeans • Chose to work with Cluster 3.0 from Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo. • The C clustering library is released under the “Python License”. • Parallelized this Kmeans code with decomposition • Gather performance results on naive parallelization • Suggest multicore-sensitive parallelizations • Early performance results of these parallelizations CICC quarterly meeting

Naive Parallelization of Cluster 3.0 Kmeans • Treat each kmeans clustering process as a black box , which takes one cluster as input, and generates two clusters as outputs • When a new cluster is generated having more than one element in it, assign it to free processor for further clustering • A master node maintains status of each node CICC quarterly meeting

Reassign to Node 1 cluster1 Original cluster Working Node1 cluster2 Assign to Node 2 Working Node2 Master Node (Reassign to Node 2) Working Node3 Assign to Node 3 Naive Parallelization of Cluster 3.0 Kmeans . . . CICC quarterly meeting

Quality of Cluster 3.0 Kmeans Naive Parallelization Pros: Don’t need to worry about the details of DivKmeans method. Can use Kmeans functions of other libraries directly. Cons: Speedup and scalability? How about parallelization overhead? CICC quarterly meeting

Profiling Naive Parallelization • Platform: • A Linux cluster, each node has two 2GHz AMD Opteron(TM) CPUs, each CPU has dual cores • Linux RHEL WS release 4 • Algorithm: Cluster 3.0, parallelized and made divisive • Dataset: Pubchem dataset of sizes 24,000 and 96,000 elements • Additional Libraries: • LAM 7.1.2/MPI CICC quarterly meeting

Speedup: naive parallelization of Cluster 3.0 speedup is defined by Sp = T1/Tp where: * p is number of processors * T1 is execution time of sequential algorithm * Tp is execution time of parallel algorithm with p processors Conclusion: maximum benefit reached at 17 nodes; significant decrease in speedup after only 5 nodes. CICC quarterly meeting

CPU Utilization: Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to overall performance. CICC quarterly meeting

Memory Utilization Conclusion: nothing outstanding CICC quarterly meeting

Process Behaviors By XMPI, which is a graphical user interface for running, debugging and visualizing MPI programs. CICC quarterly meeting

Conclusions on Naive Parallelization from Profiling • Poor scalability beyond 5 nodes. • Performance likely inhibited by 100% utilization of Node 1. Proposed Solution • Multi-core solution: using multi-threads on each node, each thread runs on one core. • How this solution will explicitly address the two problems identified above. CICC quarterly meeting

original cluster some pre-processing thread 1 thread 2 thread 3 thread 4 cluster1 Merge Results other processing Cluster 2 Proposed Solution Instead of treating each kmeans clustering process as a black box, each clustering process is decomposed into several threads. CICC quarterly meeting

Do loop Do loop While loop Finding Centroids CalculatingDistance Step 1: identify parts to decompose (parallelize) Find Centroids Calling sequence of kmeans clustering process DivKmeans Kmeans() Calculating Distance Inside Kmeans Profiling shows: -> About 93% of total execution time is spent in kmeans() functions. -> Inside kmeans() function, almost all time is spent in “Finding Centroids” and “Calculating Distance”. -> Hence, parallelize these two. CICC quarterly meeting

Simplified codes of Finding Centroids // sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values for (i = 0; i < nclusters; i++) { for (j = 0; j < ncolumns; j++) cdata[i][j] /= total_number[i][j]; } CICC quarterly meeting

Parallelized Codes of “Finding Centroids” Before parallelization After parallelization // sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values … // sum up elements assigned to current thread for (k = nrows * index / n_thread; k < nrows * (index + 1)/ n_thread; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) { if (mask[k][j] != 0) { t_data[i][j]+=data[k][j]; t_mask[i][j]++; } } } // merge data … // calculate mean values … CICC quarterly meeting

original cluster some pre-processing Core1 Core 2 Core 3 Core 4 cluster1 Merge Results other processing Cluster 2 Mapping of Algorithms into Multi-core Architectures Each thread uses one core CICC quarterly meeting

Mapping of Algorithms into Multi-core Architectures • How to further benefit from multi-core architectures? Data locality Cache aware algorithm Architecture aware algorithm CICC quarterly meeting

Mapping of Algorithms into Multi-core Architectures Example 1: AMD Opteron No cache sharing between two cores in this architecture Diagram of AMD Opteron CICC quarterly meeting

Mapping of Algorithms into Multi-core Architectures Example 2: Intel Core 2 Improve cache re-use: If two threads share common data, assign them to the cores on the same die. Diagram of an Intel Core 2 dual Core processor CICC quarterly meeting

Mapping of Algorithms into Multi-core Architectures Example 3: Dell PowerEdge 6950 NUMA (Non-Uniform Memory Access) Improve data locality: Keep data in local memory so that each thread uses local memory instead of remote ones as much as possible. CICC quarterly meeting

Early Results on Multi-core Platform Experiment Environments Platform: 3 nodes in a Linux cluster, each node has two 2GHz AMD Opteron(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4 Library: LAM 7.1.2/MPI Pthread for Linux RHEL WS release 4 Degree of Parallelization Only the code of “Finding Centroids” is parallelized for early study. 4 threads are used for “Finding Centroids” on each node, and each thread runs on one core. CICC quarterly meeting

Results of Parallelizing “Finding Centroids” Conclusion: Modest improvement. DivKmeans runs about 12% faster after parallelization. CICC quarterly meeting

Parallelizing “Finding Centroids” with Different Number of Threads per Node Total Number of Cores per Node: 4 Conclusion: can hardly benefit from using more threads than the number of cores. CICC quarterly meeting

Optimizations for Next Step • Reduce overhead of managing threads (e.g. use thread pool instead of creating new threads for each call to “Finding Centroids”) • Parallelize the “Calculating Distance” part, which consumes twice the time of “Finding Centroids” • More cores (4, 8, 32…) on a single computer are on the way. Should get more performance enhancements with more cores if the scalability of the program is good. • The platform we used (AMD Opteron TM) doesn’t support cache sharing between two cores on the same die. However, L2, and even L1 cache sharing among cores are becoming available. CICC quarterly meeting

The Multi-core Project in the Distributed Data Everywhere (DDE) Lab and the Extreme Lab • Multi-core processors: represent a major evolution in today’s computing technology • We are exploring the programming styles and challenges on multi-core platforms, and potential applications in both academic and commercial areas, including chemical-informatics, XML parsing, data streaming, Web Service, etc. CICC quarterly meeting

References 1. Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/ 2. http://www.nsc.liu.se/rd/enacts/Smith/img1.htm 3. http://www.mhpcc.edu/training/workshop/parallel_intro/ 4. http://www.digitalchemistry.co.uk/prod_clustering.html 5. Performance Benchmarking on the Dell PowerEdge™ 6950 David Morse, Dell Inc. CICC quarterly meeting

Acknowledgements