Weekly Report- Kmeans

Weekly Report-Kmeans Ph.D. Student: Leo Leedate: Nov. 13, 2009

Outline • K-means • CPU-based algorithm workflow; • Reading Kaiyong’s code; • Some naïve thoughts; • Work plan

K-means • CPU-based algorithm workflow; N data and K centers, dim dimension; Compute D[N][K] Compute MinD[N] Compute NewCenter[K] If NewCenter == center No Yes

K-means • Pseudocode: While(!bFlag && ++i <= nIterationsTime) { ComputeDis(&dis, data, centers); FindMinDis(dis, &index); ComputeNewCen(&newCen, data, index); if(newCen-centers < b) bFlag = true else centers = newCen; }

K-means • Since each iteration relays on the previous one, we would sequential run each iteration but parallel each function inside the iteration. • Compute distance; • Find the nearest center; • Computer new centers;

K-means-compute the distance • Data[N][Dim], Centers[Dim][K] • Dis[N][K] • Nearly the same as Matrix mulitipication • Only replace A[i][k]*B[K][j]->(A[i][k]-B[K][j])2 • Using the so called tiles, increase the compute to memory access ratio Centers Data Distances

K-means-compute the distance • From Kaiyong • dim3 threads(16, 2, 1); • dim3 grid(k/32, n/32, 1); • ComputeDistance<32,32><<<grid,threads>>>(…) • template <unsigned int B_WIDTH, unsigned int C_HIGH> __global__ void ComputeDistance(….) • { • }

K-means-compute the distance • float* indexQ = Query + threadIdx.x + (blockIdx.y*C_HIGH + threadIdx.y) * dim; • float* indexR = Ref + blockIdx.x*B_WIDTH + threadIdx.y * blockDim.x + threadIdx.x; • float* indexC = C + blockIdx.x * C_HIGH + threadIdx.y * blockDim.x + threadIdx.x + blockIdx.y *C_HIGH* wB ;

K-means-compute the distance • __shared__ float as[16][C_HIGH+1]; • Do • for(int i = 0; i < C_HIGH; i += 2) • as[threadIdx.x][threadIdx.y + i] = indexQ[i*dim]; • indexQ += 16; • __syncthreads(); • for(int i = 0; i < 16; i++, indexR += wB) • for( int j = 0; j < C_HIGH; j++) • { • c_temp = indexR[0]-as[i][j]; • c[j] += c_temp*c_temp; • } • __syncthreads(); • while(indexQ < Alast);

K-means-compute the distance for(int i = 0; i < C_HIGH; i++, indexC += wB) { indexC[0] = c[i]; }

K-means - compute the distance • Questions • Why template <unsigned int B_WIDTH, unsigned int C_HIGH>, not parameters? • Why load sub matrix in that way? Sth to do with WARP? If we use 16*16 tile instead of 32*32, should load method change? • This algorithm is nearly the same as the so-called most efficient Matrix Mulitiplication. • Thread(16, 4), grid(wc/4, hc/16)

Compute the distance • Very useful in data mining • K-means; • K-nn; • Hieratical clustering; • …

K-means-Find the nearest center K N • N Reductions • Sum, max, min… • Sequential addressing • Completely unroll • n/logn threads, logn steps;

K-means-Find the nearest center dim3 threads_find(16,1,1); dim3 grid_find(1, data_height, 1); template <unsigned int blockSize>__global__ voidcpu_FindSmallDistance( float* Dist, int* D_index, int k) { __shared__ float sdata[blockSize]; __shared__ int d_index[blockSize]; // perform first level of reduction, reading from g-memory, writing to s-memory unsigned int tid = threadIdx.x; unsigned int i = blockSize + threadIdx.x; float* p_data= Dist + blockIdx.y*k; sdata[tid] = p_data[tid]; d_index[tid] = tid; if (i < k) if( sdata[tid] > p_data[i]){ sdata[tid] = p_data[i]; d_index[tid] = i; } EMUSYNC;

K-means-Find the nearest center if( sdata[tid] > sdata[tid + 8] ) {sdata[tid] = sdata[tid + 8]; d_index[tid] = d_index[tid+8];} EMUSYNC; if( sdata[tid] > sdata[tid + 4] ) {sdata[tid] = sdata[tid + 4]; d_index[tid] = d_index[tid+4];} EMUSYNC; if( sdata[tid] > sdata[tid + 2] ) {sdata[tid] = sdata[tid + 2]; d_index[tid] = d_index[tid+2];} EMUSYNC; if( sdata[tid] > sdata[tid + 1] ) {d_index[tid] = d_index[tid+1];} EMUSYNC; // write result for this block to global mem if (tid == 0) D_index[blockIdx.y] = d_index[0]; } • Since the K is presumed to be equal or smaller than 32, this implementation is optimal.

K-means-Computer new centers; • CPU-based Algorithm • For each Data • C[ Index[i] ] += Data[i] • Not direct addressing • Not as beautiful as Matrix Mul and Reduction • // 把最近的点都加起来，分成100组，每组512个 • dim3 threads_collect(32,1,1); • dim3 grid_collect(100, 1, 1); Index dim N

K-means-Computer new centers; __shared__ int index[32]; __shared__ float as[32*34]; int idx = threadIdx.x; float* p_Data= Data + blockIdx.x * GroupSize*Dim; int* p_index= D_index + blockIdx.x * GroupSize; int* p_index_last= p_index + GroupSize; float* p_centro= Centro + blockIdx.x * k*Dim; int* p_counter= Counter + blockIdx.x*32; int index_i = 0; int centro_count = 0; // initial shared mem centero if(idx < k) for( int i = 0; i < Dim; i++) as[i*k+idx] = 0; EMUSYNC; Index dim N

K-means-Computer new centers; //每次取32个index，所以如果一共有512个数据，，512/32=16轮 for(; p_index < p_index_last; p_index += blockDim.x) { //取32个index放入到shared mem中，这样可以让IO结合 index[idx] = p_index[idx]; EMUSYNC; //循环次，每次计算一个数据 for(int i = 0; i < 32; i++, p_Data += Dim) { index_i = index[i]; // 每一个thread对应一个centro，所以当处理一个对应的数据时，这里++ if(idx == index_i) centro_count++; for(int j = idx; j < Dim; j += blockDim.x) { as[j*k+index_i] += p_Data[j]; } EMUSYNC; } } Index dim N

K-means-Computer new centers; • //只有在centro范围内的线程才回写数据，每个线程里面负责一个centro if(idx < k) { p_counter[idx] = centro_count; for(int j = 0; j < Dim; j++, p_centro += k) { p_centro[idx] = as[j*k+idx]; } } • Now we got 100 Dim*K Matrix • Matrix adding-reduction, each element is a matrix. • Kaiyong’s code reduces to 10, and gets the final. Dim Index N

Work plan - K-means • Test the program • Each function, GPU VS CPU; • Compare with other papers.

Work plan • K-means; • Learn data mining, prepare for final exam; • Go on reading parallel computing books.

Thanks

Weekly Report- Kmeans

Weekly Report- Kmeans

Presentation Transcript

Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

ARA Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

Weekly Report

SOEPD Weekly Report

Weekly Progress Report

Weekly report

WEEKLY PROGRESS REPORT

Weekly Report

Weekly Report

Weekly Progress Report

Weekly Equity Report