Co-clustering using CUDA

Co-clustering using CUDA

Co-Clustering Explained • Problem: • Large binary matrix of samples (rows) and features (columns) • What samples should be grouped together? Why? • What are shared features? • Co-clustering provides you the “why” explicitly • Correlated sample/feature pair Row cluster: s1and s3 are in a group Column cluster: distinguishing features are 2,3, and 5

Co-Clustering - Details • Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: • Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 • Partition entire matrix into row groups, col groups • Minimize length of encoding of resulting partitioned matrix • Competing code length factors: number of row groups & col groups, homogeneity of clusters • Iterate over rows, rearrange and sub-partition to find better encoding using heuristic • Repeat for columns, then rows again, until local optimum is found • Complexity: O(n*fp*(row_groups+col_groups)2*iters) Credit: Chakrabarti et. al, KDD 2004

Implementation - Basics • Initial matrix generation : CPU • Initial random row/column group assignment: CPU • Memory structures very simple, arrays of ints

Implementation – Stats step 1 • Statistics calculations: • Calculates statistics for each row of each column group • Statistic is number of 1’s in a column group • Straight-forward parallelization (each thread works on one row at a time), global memory Column Groups 2 3 1 3 2 Row Groups 3 5 1 1 4 Stat(Row 3, ColumnGroup 3) = 1

Room For Improvement • Calculate row statistics according to histogram algorithm from text book • Block columns • Assign one thread block to each block • Compute shared memory histograms within block • Merge back to global memory when finished

Implementation – Stats step 2 • Calculates cost for each row group of each column group • Essentially a reduce on the per-row data • Block the rows, assign block to thread block • Use shared memory and atomics to build histogram of all rows in a given row group • Merge shared histogram with global histogram for that row group • Iterate over all row groups Column Groups 2 3 1 3 2 Row Groups 3 5 1 1 4 Stat(RowGroup 1, ColumnGroup 3) = 2

Implementation – Row/Col Group Optimization • For each row, find optimal group it could belong to • Parallelized straight-forwardly, one row per thread, loop and stride to get all rows • Each row calculation goes through all row groups, determines global cost of moving to that row group • Move all rows to their optimal group • Recompute statistics • Repeat for column groups • Continue alternating row/column groupings until convergence

Room For Improvement • Parallelization could be more sophisticated • Could block the rows and compute the cost of the row joining each row group in parallel • Using shared memory atomics to identify minimum cost • In practice, this algorithm heavily favors a small number of row and column groups • The parllelization would be therefore be small

Implementation Outer Loop • After local minimum is found, change initial number of row and column groups and retry • Change number of row groups or number of column groups, up or down • Continue changing number of row or column groups in that direction until cost fails to decrease • Try both directions in both dimensions before stopping • Outer loop performed on CPU

Room for Improvement • Outer loop could parallelize inner loop actions over different GPUs • Each could explore the different dimensions and directions in parallel

Implementation – CPU + Validation • CPU implementation performed all steps described earlier, but sequentially • Validation • Used CPU implementation of statistics calculations to validate GPU stats calculations • CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results • Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable

Timing Measurements • Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation • Measured by cuda events under GPU implementation

Development Lessons Learned • CUDA and structured data is a bad idea • Even structs of arrays are impossible to deal with • Host-side pointer math on device pointers does not work • CUDA API has REALLY unfriendly error messages • Take care to do very, very little through that API • __device__ variables declared globally must be passed to kernels • Runtime errors otherwise • You can malloc and free shared memory in device code as of 3.2

Development Lessons Learned Cont • Visual Studio CUDA integration leaves a lot to be desired • All optimizations removed, still can’t set breakpoints everywhere • Many variables show as freed • No in-IDE, real-time, in editor compile errors • But, Visual Studio does give nice auto-complete, auto-definition navigation • No CUDA linker => separate files must be directly #include’d

Experiment - Environment • Float.cs.drexel.edu • CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz • GPU: 2 NvidiaGeForce GTX 580 GPUs @1544Mhz

Experiment - Description • Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000 • Larger matrices caused memory problems • GPU tested with varying block and thread counts • Num blocks: 10, 100, 5000 • Num threads: 10, 100, 1024 (max) • Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order • Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these

Experiment Results Num Threads

Experiment Results • For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order • I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups~= num_col_groups • Slowdown occurs when matrix order exceeds total number of threads and more must be done serially

Experiment - Results Num Threads

Experiment Results • Interestingly, the maximum speedup was the same in all block counts • Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved • 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case? • Maxing out thread count is counter-productive in smaller matrices • Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly

Experiment Results • Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks • Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads

Co-clustering using CUDA

Co-clustering using CUDA

Presentation Transcript

Fast Background Subtraction using CUDA

Cuda

CUDA

File Compression Using the CUDA Framework

GPU Optimization using CUDA Framework

Using The CUDA Programming Model

CUDA

CUDA

Using CUDA Libraries with OpenACC

CUDA

Software Clustering Using Bunch

Idea of Co-Clustering

Answering List Questions using Co-occurrence and Clustering

Sparsity-Cognizant Overlapping Co-clustering

CUDA

Fluid Simulation using CUDA

Sparsity-Cognizant Overlapping Co-clustering

Software Clustering Using Bunch