1 / 28

Scalable Clustering using Multiple GPUs

Scalable Clustering using Multiple GPUs. K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad. Introduction. Classification of data desired for meaningful representation.

rozene
Download Presentation

Scalable Clustering using Multiple GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Clustering using Multiple GPUs K WasifMohiuddinP J Narayanan Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)Hyderabad

  2. Introduction • Classification of data desired for meaningful representation. • Data in subset ideally shares common traits. • Unsupervised learning for finding hidden structure. • Application in data mining, computer vision with • Image Classification • Document Retrieval • Simple K-Means algorithm HiPC - 2011

  3. Need for High Performance Clustering • Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFT[Lowe IJCV04] vectors HiPC - 2011

  4. Challenges and Contributions • Data: Storage of data format for quick and repeated access. • Computational: O(ndk+1 log n) • Contributions: A complete GPU based implementation with • Exploitation of intra-vector parallelism • Efficient Mean evaluation • Data Organization • Multi GPU framework HiPC - 2011

  5. Related Work • General Improvements • KD-trees [Moor et al, SIGKKD-1999 • Triangle Inequality [Elkan, ICML-2003] • Pre CUDA GPU Efforts Improvements • Fragment Shader[Hart et al, SIGGRAPH-2004] HiPC - 2011

  6. Related Work (cont) • Recent GPU efforts • Mean on CPU [Che et al, JPDC-2008] • Mean on CPU + GPU [Hong et al, WCCSIE-2009] • GPU Miner [Ren et al, HP Labs-2009] • HPK-Means [Wu et al, UCHPC-2009] • Divide & Rule [Li et al, ICCIT-2010] • Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation on GPU. • Proposed techniques are parameter dependant. HiPC - 2011

  7. K-Means • Objective Function ∑i∑j‖xi(j)-cj‖2 1≤i≤n, 1≤ j ≤k • Euclidean distance : L2 norm • Steps: • Membership Evaluation • New Mean Evaluation • Convergence HiPC - 2011

  8. Algorithm • K random centers are initially chosen from input. • Partitions data into k clusters • Observation belongs to the cluster with the nearest mean. • Re-evaluate the new centers & continue the process till convergence is attained. HiPC - 2011

  9. K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers stored in L2 cache. HiPC - 2011

  10. K-Means on GPU (Cont) Membership Evaluation • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Root finding avoided HiPC - 2011

  11. K-Means on GPU (Cont) • Mean Evaluation Issues • Data rearrangement on CPU as per membership is time consuming. • Concurrent writes • Random reads and writes • Non uniform distribution of labels for data objects. HiPC - 2011

  12. Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsortoperation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort: Suryakant & Narayanan IIITH, TR 2009 HiPC - 2011

  13. Splitsort & Transpose Operation HiPC - 2011

  14. Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • CUDPP segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels. HiPC - 2011

  15. Implementation Details • Tesla • 2 vectors per block , 2 centers at a time • Centers accessed via shared memory • Fermi • 2 vectors per block, 4 centers at a time • Centers accessed via global memory using L2 cache • More shared memory for distance evaluation • Occupancy of 83% using 5136 KB of shared memory in case of Fermi. HiPC - 2011

  16. ISSUES • Too many distance evaluations • Convergence highly dependent on cluster centers chosen. • Prior seeding using K-Means++ can reduce the number of iterations. • Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. HiPC - 2011

  17. Limitations of GPU device • Highly computational & memory consuming algorithms. • Limited Global and Shared memory on a GPU device. • Division of computational load if more than one device is available. • Utilization of every resource available. • Scalability of the algorithm HiPC - 2011

  18. Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes. • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. HiPC - 2011

  19. Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax: 256 • Comparison with previous GPU implementations. HiPC - 2011

  20. Overall Results Times of K-Means on CPU, GPUs in seconds for d=128. HiPC - 2011

  21. Overall Performance • Mean evaluation reduced to 6% of the total time for large input of high dimensional data. • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Achieved up to twice increase in speedup against the best GPU implementation HiPC - 2011

  22. Performance vs ‘n’ Linear performance for variation in n, with d=128 and k=4,000. HiPC - 2011

  23. Performance vs ‘d’ Performance for variation in d, with n=1M and k=8,000. HiPC - 2011

  24. Performance vs ‘k’ Linear performance for variation in k, with n=50k and d=128. HiPC - 2011

  25. Comparison Running time of K-Means in seconds on GTX 280. HiPC - 2011

  26. Performance on GPUs Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000. HiPC - 2011

  27. Conclusions • Achieved a speed of over 170 on single NVIDIA Fermi GPU. • Complete GPU based implementation. • High Performance for large ‘d’ due to processing of vector in parallel. • Scalable in problem size n, d, k and number of cores. • Use of operations like Splitsort, Transpose for coalesced memory access. • Overcame memory limitations using Multi GPU frame work. • Code will be available at http://cvit.iiit.ac.in soon HiPC - 2011

  28. Thank You Questions?

More Related