Download Presentation
## Co-clustering using CUDA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Co-Clustering Explained**• Problem: • Large binary matrix of samples (rows) and features (columns) • What samples should be grouped together? Why? • What are shared features? • Co-clustering provides you the “why” explicitly • Correlated sample/feature pair Row cluster: s1and s3 are in a group Column cluster: distinguishing features are 2,3, and 5**Co-Clustering - Details**• Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: • Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 • Partition entire matrix into row groups, col groups • Minimize length of encoding of resulting partitioned matrix • Competing code length factors: number of row groups & col groups, homogeneity of clusters • Iterate over rows, rearrange and sub-partition to find better encoding using heuristic • Repeat for columns, then rows again, until local optimum is found • Complexity: O(n*fp*(row_groups+col_groups)2*iters) Credit: Chakrabarti et. al, KDD 2004**Implementation - Basics**• Initial matrix generation : CPU • Initial random row/column group assignment: CPU • Memory structures very simple, arrays of ints**Implementation – Stats step 1**• Statistics calculations: • Calculates statistics for each row of each column group • Statistic is number of 1’s in a column group • Straight-forward parallelization (each thread works on one row at a time), global memory Column Groups 2 3 1 3 2 Row Groups 3 5 1 1 4 Stat(Row 3, ColumnGroup 3) = 1**Room For Improvement**• Calculate row statistics according to histogram algorithm from text book • Block columns • Assign one thread block to each block • Compute shared memory histograms within block • Merge back to global memory when finished**Implementation – Stats step 2**• Calculates cost for each row group of each column group • Essentially a reduce on the per-row data • Block the rows, assign block to thread block • Use shared memory and atomics to build histogram of all rows in a given row group • Merge shared histogram with global histogram for that row group • Iterate over all row groups Column Groups 2 3 1 3 2 Row Groups 3 5 1 1 4 Stat(RowGroup 1, ColumnGroup 3) = 2**Implementation – Row/Col Group Optimization**• For each row, find optimal group it could belong to • Parallelized straight-forwardly, one row per thread, loop and stride to get all rows • Each row calculation goes through all row groups, determines global cost of moving to that row group • Move all rows to their optimal group • Recompute statistics • Repeat for column groups • Continue alternating row/column groupings until convergence**Room For Improvement**• Parallelization could be more sophisticated • Could block the rows and compute the cost of the row joining each row group in parallel • Using shared memory atomics to identify minimum cost • In practice, this algorithm heavily favors a small number of row and column groups • The parllelization would be therefore be small**Implementation Outer Loop**• After local minimum is found, change initial number of row and column groups and retry • Change number of row groups or number of column groups, up or down • Continue changing number of row or column groups in that direction until cost fails to decrease • Try both directions in both dimensions before stopping • Outer loop performed on CPU**Room for Improvement**• Outer loop could parallelize inner loop actions over different GPUs • Each could explore the different dimensions and directions in parallel**Implementation – CPU + Validation**• CPU implementation performed all steps described earlier, but sequentially • Validation • Used CPU implementation of statistics calculations to validate GPU stats calculations • CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results • Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable**Timing Measurements**• Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation • Measured by cuda events under GPU implementation**Development Lessons Learned**• CUDA and structured data is a bad idea • Even structs of arrays are impossible to deal with • Host-side pointer math on device pointers does not work • CUDA API has REALLY unfriendly error messages • Take care to do very, very little through that API • __device__ variables declared globally must be passed to kernels • Runtime errors otherwise • You can malloc and free shared memory in device code as of 3.2**Development Lessons Learned Cont**• Visual Studio CUDA integration leaves a lot to be desired • All optimizations removed, still can’t set breakpoints everywhere • Many variables show as freed • No in-IDE, real-time, in editor compile errors • But, Visual Studio does give nice auto-complete, auto-definition navigation • No CUDA linker => separate files must be directly #include’d**Experiment - Environment**• Float.cs.drexel.edu • CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz • GPU: 2 NvidiaGeForce GTX 580 GPUs @1544Mhz**Experiment - Description**• Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000 • Larger matrices caused memory problems • GPU tested with varying block and thread counts • Num blocks: 10, 100, 5000 • Num threads: 10, 100, 1024 (max) • Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order • Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these**Experiment Results**Num Threads**Experiment Results**• For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order • I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups~= num_col_groups • Slowdown occurs when matrix order exceeds total number of threads and more must be done serially**Experiment - Results**Num Threads**Experiment Results**Num Threads**Experiment Results**• Interestingly, the maximum speedup was the same in all block counts • Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved • 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case? • Maxing out thread count is counter-productive in smaller matrices • Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly**Experiment Results**Num Threads**Experiment Results**Num Threads**Experiment Results**Num Threads**Experiment Results**• Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks • Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads