Scalable Clustering with Multiple GPUs for Enhanced Performance in Data Mining
280 likes | 420 Views
This paper presents an efficient and scalable clustering algorithm utilizing multiple GPUs to handle large datasets in high-dimensional spaces. Focused on K-Means clustering, the proposed approach addresses the challenges of high computational complexity and resource constraints. By exploiting intra-vector parallelism and innovative data organization, the method enhances mean evaluation and membership determination. The findings demonstrate significant improvements over traditional CPU-bound methods, making it suitable for applications in data mining and computer vision, such as image classification and document retrieval.
Scalable Clustering with Multiple GPUs for Enhanced Performance in Data Mining
E N D
Presentation Transcript
Scalable Clustering using Multiple GPUs K WasifMohiuddinP J Narayanan Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)Hyderabad
Introduction • Classification of data desired for meaningful representation. • Data in subset ideally shares common traits. • Unsupervised learning for finding hidden structure. • Application in data mining, computer vision with • Image Classification • Document Retrieval • Simple K-Means algorithm HiPC - 2011
Need for High Performance Clustering • Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFT[Lowe IJCV04] vectors HiPC - 2011
Challenges and Contributions • Data: Storage of data format for quick and repeated access. • Computational: O(ndk+1 log n) • Contributions: A complete GPU based implementation with • Exploitation of intra-vector parallelism • Efficient Mean evaluation • Data Organization • Multi GPU framework HiPC - 2011
Related Work • General Improvements • KD-trees [Moor et al, SIGKKD-1999 • Triangle Inequality [Elkan, ICML-2003] • Pre CUDA GPU Efforts Improvements • Fragment Shader[Hart et al, SIGGRAPH-2004] HiPC - 2011
Related Work (cont) • Recent GPU efforts • Mean on CPU [Che et al, JPDC-2008] • Mean on CPU + GPU [Hong et al, WCCSIE-2009] • GPU Miner [Ren et al, HP Labs-2009] • HPK-Means [Wu et al, UCHPC-2009] • Divide & Rule [Li et al, ICCIT-2010] • Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation on GPU. • Proposed techniques are parameter dependant. HiPC - 2011
K-Means • Objective Function ∑i∑j‖xi(j)-cj‖2 1≤i≤n, 1≤ j ≤k • Euclidean distance : L2 norm • Steps: • Membership Evaluation • New Mean Evaluation • Convergence HiPC - 2011
Algorithm • K random centers are initially chosen from input. • Partitions data into k clusters • Observation belongs to the cluster with the nearest mean. • Re-evaluate the new centers & continue the process till convergence is attained. HiPC - 2011
K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers stored in L2 cache. HiPC - 2011
K-Means on GPU (Cont) Membership Evaluation • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Root finding avoided HiPC - 2011
K-Means on GPU (Cont) • Mean Evaluation Issues • Data rearrangement on CPU as per membership is time consuming. • Concurrent writes • Random reads and writes • Non uniform distribution of labels for data objects. HiPC - 2011
Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsortoperation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort: Suryakant & Narayanan IIITH, TR 2009 HiPC - 2011
Splitsort & Transpose Operation HiPC - 2011
Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • CUDPP segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels. HiPC - 2011
Implementation Details • Tesla • 2 vectors per block , 2 centers at a time • Centers accessed via shared memory • Fermi • 2 vectors per block, 4 centers at a time • Centers accessed via global memory using L2 cache • More shared memory for distance evaluation • Occupancy of 83% using 5136 KB of shared memory in case of Fermi. HiPC - 2011
ISSUES • Too many distance evaluations • Convergence highly dependent on cluster centers chosen. • Prior seeding using K-Means++ can reduce the number of iterations. • Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. HiPC - 2011
Limitations of GPU device • Highly computational & memory consuming algorithms. • Limited Global and Shared memory on a GPU device. • Division of computational load if more than one device is available. • Utilization of every resource available. • Scalability of the algorithm HiPC - 2011
Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes. • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. HiPC - 2011
Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax: 256 • Comparison with previous GPU implementations. HiPC - 2011
Overall Results Times of K-Means on CPU, GPUs in seconds for d=128. HiPC - 2011
Overall Performance • Mean evaluation reduced to 6% of the total time for large input of high dimensional data. • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Achieved up to twice increase in speedup against the best GPU implementation HiPC - 2011
Performance vs ‘n’ Linear performance for variation in n, with d=128 and k=4,000. HiPC - 2011
Performance vs ‘d’ Performance for variation in d, with n=1M and k=8,000. HiPC - 2011
Performance vs ‘k’ Linear performance for variation in k, with n=50k and d=128. HiPC - 2011
Comparison Running time of K-Means in seconds on GTX 280. HiPC - 2011
Performance on GPUs Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000. HiPC - 2011
Conclusions • Achieved a speed of over 170 on single NVIDIA Fermi GPU. • Complete GPU based implementation. • High Performance for large ‘d’ due to processing of vector in parallel. • Scalable in problem size n, d, k and number of cores. • Use of operations like Splitsort, Transpose for coalesced memory access. • Overcame memory limitations using Multi GPU frame work. • Code will be available at http://cvit.iiit.ac.in soon HiPC - 2011
Thank You Questions?