mapreduce as a language for parallel computing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
MapReduce As A Language for Parallel Computing PowerPoint Presentation
Download Presentation
MapReduce As A Language for Parallel Computing

Loading in 2 Seconds...

play fullscreen
1 / 14
radha

MapReduce As A Language for Parallel Computing - PowerPoint PPT Presentation

99 Views
Download Presentation
MapReduce As A Language for Parallel Computing
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University

  2. Future Architecture • Many alternatives • A few powerful cores( Intel/AMD, 2,3,4,6 …) • Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) • Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult • OpenMP, MPI, MapReduce • CUDA, Brooks • Verilog, System C

  3. What makes parallel computing so difficult • Parallelism identification and expression • Autoparallelizing has been failed so far • Complex synchronization may be required • Data races and deadlocks which are difficult to debug • Load balance…

  4. Map-Reduce is promising • Can only solve a subset of problems • But an important and fast growing subset, such as indexing • Easy to use • Programmers only need to write sequential code • The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore • But many dialects, which hurt the portability

  5. Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • Combine size and offset information with the key/val pair • How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support • How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to

  6. MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output

  7. MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output

  8. Program Example • Word Count (Phoenix Implementation) …    for (i = 0; i < args->length; i++)   {      curr_ltr = toupper(data[i]);      switch (state)      {      case IN_WORD:         data[i] = curr_ltr;         if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘)     {            data[i] = 0;            emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1);            state = NOT_IN_WORD;         }      break; …

  9. Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize){…. do  {….     if (*line != ' ‘)   line++;     else {             line++;             GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int));             while (*line == ' ‘)   {                 line++;             }             wordSize = 0;     }  } while (*line != '\n');…} __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do  {….     if (*line != ' ‘)   line++;     else {             line++;             GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int));              while (*line == ' ‘)   {                 line++;             }             wordSize = 0;     }  } while (*line != '\n');…}

  10. Pros and Cons • Load Balance • Phoenix: Static + Dynamic • Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation • Lock free • requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars • Phoenix use insertion sorts dynamically during emitting • Mars use bitonic sort -- O(n*logn*logn)

  11. Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising • Map-Reduce already specify the parallelism well • No complex synchronizations in users code • But still difficult • Different architecture provides different features • Either portability and performance issues • Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C

  12. Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster library &Runtime Cluster library &Runtime Cluster library &Runtime Map-Reduce Multicore Multicore library &Runtime Map-Reduce General Multicore library &Runtime Map-Reduce GPU GPU library &Runtime GPU

  13. Case study on nVidia GPU • Portability • Host function support • Annotating libc and inline • Dynamic memory allocation • Big problem, not support that in user code? • Performance • Memory Hierarchy Optimization( global, shared, readonly memory identification ) • Typed Language is preferrable( int4 type acceleration…) • Dynamic memory allocation(again!)

  14. More to explore • …