1 / 14

MapReduce As A Language for Parallel Computing

MapReduce As A Language for Parallel Computing. Wenguang CHEN, Dehao CHEN Tsinghua University. Future Architecture. Many alternatives A few powerful cores( Intel/AMD, 2,3,4,6 …) Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) Heterogenous( CELL, 1/8; FPGA speedup … )

radha
Download Presentation

MapReduce As A Language for Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University

  2. Future Architecture • Many alternatives • A few powerful cores( Intel/AMD, 2,3,4,6 …) • Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) • Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult • OpenMP, MPI, MapReduce • CUDA, Brooks • Verilog, System C

  3. What makes parallel computing so difficult • Parallelism identification and expression • Autoparallelizing has been failed so far • Complex synchronization may be required • Data races and deadlocks which are difficult to debug • Load balance…

  4. Map-Reduce is promising • Can only solve a subset of problems • But an important and fast growing subset, such as indexing • Easy to use • Programmers only need to write sequential code • The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore • But many dialects, which hurt the portability

  5. Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • Combine size and offset information with the key/val pair • How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support • How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to

  6. MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output

  7. MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output

  8. Program Example • Word Count (Phoenix Implementation) …    for (i = 0; i < args->length; i++)   {      curr_ltr = toupper(data[i]);      switch (state)      {      case IN_WORD:         data[i] = curr_ltr;         if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘)     {            data[i] = 0;            emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1);            state = NOT_IN_WORD;         }      break; …

  9. Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize){…. do  {….     if (*line != ' ‘)   line++;     else {             line++;             GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int));             while (*line == ' ‘)   {                 line++;             }             wordSize = 0;     }  } while (*line != '\n');…} __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do  {….     if (*line != ' ‘)   line++;     else {             line++;             GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int));              while (*line == ' ‘)   {                 line++;             }             wordSize = 0;     }  } while (*line != '\n');…}

  10. Pros and Cons • Load Balance • Phoenix: Static + Dynamic • Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation • Lock free • requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars • Phoenix use insertion sorts dynamically during emitting • Mars use bitonic sort -- O(n*logn*logn)

  11. Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising • Map-Reduce already specify the parallelism well • No complex synchronizations in users code • But still difficult • Different architecture provides different features • Either portability and performance issues • Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C

  12. Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster library &Runtime Cluster library &Runtime Cluster library &Runtime Map-Reduce Multicore Multicore library &Runtime Map-Reduce General Multicore library &Runtime Map-Reduce GPU GPU library &Runtime GPU

  13. Case study on nVidia GPU • Portability • Host function support • Annotating libc and inline • Dynamic memory allocation • Big problem, not support that in user code? • Performance • Memory Hierarchy Optimization( global, shared, readonly memory identification ) • Typed Language is preferrable( int4 type acceleration…) • Dynamic memory allocation(again!)

  14. More to explore • …

More Related