Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures

Lecture 4MapReduce Software Frameworks and CUDA GPU Architectures

From MapReduce to Hadoop and Spark • MapReduce is a software framework • Designed for bipartite graph computing • Built with a master-worker model • Supports parallel and distributed computing on large data sets • Abstracts the data flow of running a parallel program on a distributed computing system • By providing users with two interfaces in the form of two functions, i.e., Map and Reduce • Users can override these two functions to interact with and manipulate the data flow of running the programs

From MapReduce to Hadoop and Spark (cont.) • MapReduce applies dynamic execution, fault tolerance, and easy-to-use APIs • Performs Map and Reduce functions in a pipelined fashion • MapReduce software framework was first proposed and implemented by Google • Google MapReduce paradigm is written in C • Evolved from use in a search engine to Google App Engine cloud • Initially, Google’s MapReduce was applied only in fast search engines • Then MapReduce enabled cloud computing

From MapReduce to Hadoop and Spark (cont.) • Apache Hadoop has made MapReduce • Possible for big data processing on large server clusters or clouds • Apache Spark frees up many constraints by MapReduce and Hadoop programming • In general-purpose batch or streaming applications • The MapReduce framework is only for batch processing of large data sets • Deal with a static data set which will not change during execution • Streaming data or real-time data cannot be handled well in batch mode

The MapReduce Compute Engine • The MapReduce framework provides an abstraction layer for data and control flow • The logical data flow from the Map to the Reduce • The control flow is hidden from users

The MapReduce Compute Engine (cont.) • The MapReduce library is essentially the controller of the MapReduce pipeline • Coordinates the dataflow from the input end to the output end in a synchronous manner • The API tools are used to provide an abstraction to hide the MapReduce software framework from intervention by users, randomly • The data flow in a MapReduce framework is predefined • Data partitioning, mapping and scheduling, synchronization, communication, and output of results • Partitioning is controlled in user programs • By specifying the partitioning block size and data fetch patterns

The MapReduce Compute Engine (cont.) • The abstraction layer provides two well-defined interfaces in two functions: Map and Reduce • These mapper and reducer functions can be defined by the user to achieve specific objectives • The user overrides the Map and Reduce functions • Map and Reduce functions take a specification object, called Spec • First initialized inside the user’s program • The user writes code to fill it with the names of input and output files as well as other tuning parameters • Also filled with the names of the Map and the Reduce functions • Invokes the provided MapReduce(Spec, &Results) function from the library to start the flow of data

The MapReduce Compute Engine (cont.)

The MapReduce Compute Engine (cont.) • The overall structure of a user’s MapReduce program

Logical Dataflow • The input data to both the Map and the Reduce function have a particular structure • The same argument goes for the output data too • The input data to the Map function is arranged in the form of a (key, value) pair • The value is the actual data • The key part is only used to control the data flow • e.g., The key is the line offset within the input file and the value is the content of the line • The output data from the Map function is structured as (key, value) pairs • Called intermediate (key, value) pairs

Logical Dataflow (cont.) • The Map function processes each input (key, value) pair • To produce s few intermediate (key, value) pairs • The aim is to process all input (key, value) pairs to the Map function in parallel • e,g, The map function emits each word w plus an associated count of occurrences • Just a 1 is recorded in this pseudo-code • The Reduce function receives the intermediate (key, value) pairs

Logical Dataflow (cont.) • In the form of a group of intermediate values (key, [set of values]) associated with one intermediate key • MapReduce framework forms these groups by first sorting the intermediate (key, value) pairs • Then grouping values with the same key • Sorting the data is done to simplify the grouping process • The Reduce function processes each (key, [set of values]) group • Produces a set of (key, value) pairs as output • e.g., The reduce function merges the word counts by different map workers • Into a total count as output

Logical Dataflow (cont.)

Logical Dataflow (cont.) • Word Count Using MapReduce over Partitioned Data Set • One of the well-known MapReduce problems • The word count problem for a simple input file containing only two lines • most people ignore most poetry • most poetry ignores most people • The Map function simultaneously produces a number of intermediate (key, value) pairs for each line content • Each word is the intermediate key with 1 as its intermediate value, e.g., (ignore, 1) • The MapReduce library collects all the generated intermediate (key, value) pairs

Logical Dataflow (cont.) • Sorts them to group the 1s for identical words, e.g., (people, [1,1]) • Groups are then sent to the Reduce function in parallel • It can sum up the 1 values for each word • Generate the actual number of occurrences for each word in the file, e.g., (people, 2)

Logical Dataflow (cont.) • Hadoop Implementation of a MapReduce WebVisCounter Program • WebVisCounter counts the number of times that users connect to or visit a given website using a particular operating system • The input data is a typical web server log file

Logical Dataflow (cont.) • Data flow in WebVisCounter program execution • The Map function parses each line to extract the type of the used OS as a key and assigns a value 1 to it • The Reduce function in turn sums up the number of 1s for each unique key

Logical Dataflow (cont.) • Each Map server applies the map function to each input data split • Many mapper functions run concurrently on hundreds or thousands of machine instances • Many intermediate key-value pairs are generated • Stored in local disks for subsequent use • The original MapReduce is slow on large clusters • Due to disk-based handling of intermediate results • The Reduce server collates the values using the reduce function • The reducer function can be max., min., average, dot product of two vectors, etc

Formal MapReduce Model • The Map function is applied in parallel to every input (key, value) pair • Produces a new set of intermediate (key, value) pairs • MapReduce library collects all the produced intermediate pairs from all input pairs • Sorts them based on the key part • Groups the values of all occurrences of the same key • The Reduce function is applied in parallel to each group • To produce the collection of values as output

Formal MapReduce Model (cont.) • After grouping all the intermediate data • The values of all occurrences of the same key are sorted and grouped together • Each key becomes unique in all intermediate data • Finding unique keys is the starting point to solving a typical MapReduce problem • The intermediate (key, value) pairs as the output of map function will be automatically produced • Examples of how to define keys and values • Count the number of occurrences of each word in a collection of documents in the above example

Formal MapReduce Model (cont.) • Count the number of occurrences of anagrams in a collection of documents • Anagrams are words that are formed by rearranging the letters of anotherword • e.g., listen can be reworked into the word silent • The unique keys are an alphabetically sorted sequence of letters for each word, e.g., eilnst • The intermediate value is the number of occurrences • The main responsibility of the MapReduce framework • To efficiently run a user’s program on a distributed computing system • Carefully handles all partitioning, mapping, synchronization, communication, and scheduling details of such data flows

Formal MapReduce Model (cont.)

Formal MapReduce Model (cont.) • Intermediate (key, value) pairs produced are partitioned into R regions • R is equal to number of reduce tasks • Guarantees that (key, value) pairs with identical keys are stored in the same region • Reduce workers may face network congestion • Caused by reduction or merging operation performed

Compute-Data Locality • The MapReduce implementation takes advantage of Google File System (GFS) as the underlying layer • MapReduce can perfectly adapt itself to GFS • GFS is a distributed file system • Files are divided into fixed-size blocks (chunks) • Blocks are distributed and stored on cluster nodes • MapReduce library splits the input data (files) into fixed-size blocks • Ideally performs the Map function in parallel on each block

Compute-Data Locality (cont.) • GFS has already stored files as a set of blocks • MapReduce just needs to send a copy of the user’s program containing the Map function to the nodes already stored as data blocks • The notion of sending computation toward data rather than sending data toward computation

MapReduce for Parallel Matrix Multiplication • In multiplying two n×n matrices A = (aij) and B = (bij) • Need to perform n2 dot product operations to produce an output matrix C = (cij) • Each dot product produces an output element cij = ai1 × b1j + ai2 × b2j + ∙ ∙ ∙ + ain × bnj • Corresponding to the i-th row vector in matrix A multiplied by the j-th column vector in matrix B • Mathematically, each dot product takes n multiply-and-add time units to complete • The total matrix multiply complexity equals n×n2 since there are n2 output elements • In theory, the n2 dot products are totally independent of each other

MapReduce for Parallel Matrix Multiplication (cont.) • Can be done on n2 servers in n time units • When n is very large, say millions or higher • Too expensive to build a cluster with n2 servers • In practice, only the use of N << n2servers • The ideal speedup is expected to be N • MapReduce Multiplication of Two Matrices • Apply the MapReduce method to multiply two 2×2 matrices: A = (aij) and B = (bij) • With two mappers and one reducer

MapReduce for Parallel Matrix Multiplication (cont.) • Map the first and second rows row of matrix A and entire matrix B to the first and second Map servers, respectively • Four keys are used to identify four blocks of data processed • K11, K12, K21, and K22 • Simply denoted by the matrix element indices • Partition matrix A and matrix BTby rows into two blocks, horizontally • BT is the transposed matrix of B • Data blocks are read into the two mappers • All intermediate computing results are identified by their <key, value> pairs

MapReduce for Parallel Matrix Multiplication (cont.) • The generation, sorting, and grouping of four <key, value> pairs by each mapper in two stages • Each short pair <key, value> holds a single partial-product value identified by its key • The long pair holds two partial products identified by each block key • The Reducer is used to sum up the output matrix elements using four long <key, value(s)> pairs • Consider six mappers and two reducers • Each mapper handles n/6 adjacent rows of the input matrix • Each reducer generates n/2 of the output matrix C

MapReduce for Parallel Matrix Multiplication (cont.) • When the matrix order becomes very large • The time to multiply very large matrices becomes cost prohibitive • A dataflow graph for the above example

GPU Computing to Exascale and Beyond • Multicore CPUs may increase from the tens of cores to hundreds or more in the future • CPU has reached its limit in terms of exploiting massive parallelism due to liming memory speed • Triggered the development of many-core GPUs with hundreds or more thin cores • x-86 processors have been extended to serve HPC systems in some high-end server processors • Many RISC processors have been replaced with multicore x-86 processors and many-core GPUs • This trend indicates that x-86 upgrades will dominate in data centers and supercomputers • The GPU also has been applied in large clusters to build supercomputers in massively parallel processors (MPPs)

GPU Computing to Exascale and Beyond (cont.) • In the future, the processor industry will develop asymmetric/heterogeneous chip multiprocessors • With both fat CPU cores and thin GPU cores on chip • Internal to each node of the cloud • Multithreading is practiced with a large number of cores in many-core GPU clusters • Four challenges for exascale computing • Energy and power & Memory and storage • Needs to optimize the storage hierarchy and tailor the memory to the applications • Concurrency and locality & System resiliency • Needs to promote self-aware OS and runtime support and build locality-aware compilers and auto-tuners • Self-aware OS/RT systems have the ability to adapt to the current situation and react to runtime events

GPU Computing to Exascale and Beyond (cont.) • A graphics processing unit (GPU) is a graphics coprocessor or accelerator • Mounted on a computer’s graphics/video card • Offloads the CPU from tedious graphics tasks in video editing applications • The world’s first GPU, the GeForce 256, was marketed by NVIDIA in 1999 • Modern GPU chips can process a minimum of 10 million polygons per second • Used in nearly every computer on the market today • Some features were also integrated into certain CPUs • Traditional CPUs are structured with only a few cores

GPU Computing to Exascale and Beyond (cont.) • e.g., The Xeon X5670 CPU has six cores • Modern GPU chips can be built with hundreds of cores • GPUs have a throughput architecture that exploits massive parallelism by executing many concurrent threads slowly • Instead of executing a single long thread in a conventional microprocessor very quickly • Parallel GPUs or GPU clusters have been adopted • Against the use of CPUs with limited parallelism • General-purpose computing on GPUs have appeared in the HPC field, known as GPGPUs • NVIDIA’s CUDA (Compute Unified Device Architecture) model is for HPC using GPGPUs

How GPUs Work • Early GPUs functioned as coprocessors attached to the CPU • Today, the NVIDIA GPU has been upgraded to 128 cores on a single chip • Each core can handle eight threads of instructions • Translates to having up to 1,024 threads executed concurrently on a single GPU • True massive parallelism, compared to only a few threads that can be handled by a conventional CPU • Achieves exascale-scale computing, Eflops or 1018flops • Optimized to deliver much higher throughput with explicit management of on-chip memory • The CPU is optimized for latency caches

How GPUs Work (cont.) • Modern GPUs are not restricted to accelerated graphics or video coding • Also used in HPC systems • To power supercomputers with massive parallelism at multicore and multithreading levels • Designed to handle large numbers of floating-point operations in parallel • In a way, the GPU offloads the CPU from all data-intensive calculations • Not just those related to video processing widely used in mobile phones, game consoles, PCs, servers, etc • e.g., The NVIDIA CUDA Tesla or Fermi is used in GPU clusters or in HPC systems for parallel processing of massive floating-pointing data

How GPUs Work (cont.) • The interaction between a CPU and GPU • In performing parallel execution of floating-point operations concurrently • The CPU is the conventional multicore processor with limited parallelism to exploit • The GPU has a many-core architecture • Hundreds of simple processing cores organized as multiprocessors • Each core can have one or more threads • The CPU’s floating-point kernel computation role is largely offloaded to the many-core GPU • The CPU instructs the GPU to perform massive data processing

How GPUs Work (cont.) • The bandwidth must be matched between the on-board main memory and the on-chip GPU memory • This process is carried out in NVIDIA’s CUDA programming using the GeForce 8800 or Tesla and Fermi GPUs • The NVIDIA Fermi GPU Chip with 512 CUDA Cores

How GPUs Work (cont.) • In 2010, three of the five fastest supercomputers in the world used large numbers of GPU chips to accelerate floating-point computations • i.e., The Tianhe-1a, Nebulae, and Tsubame • The architecture of the Fermi GPU • A next-generation GPU from NVIDIA • A streaming multiprocessor (SM) module • Multiple SMs can be built on a single GPU chip • The Fermi chip has 16 SMs implemented with 3 billion transistors • Each SM comprises up to 512 streaming processors (SPs), known as CUDA cores • The Tesla GPUs used in the Tianhe-1a have a similar architecture, with 448 CUDA cores

How GPUs Work (cont.) • The Fermi GPU is a newer generation of GPU • Can be used in desktop workstations to accelerate floating-point calculations • Or for building large-scale data centers • There are 32 CUDA cores per SM • Each CUDA core has a simple pipelined integer ALU and an FPU that can be used in parallel • Each SM has 16 load/store units • Allowing source and destination addresses to be calculated for 16 threads per clock • There are four special function units (SFUs) for executing transcendental instructions • These instructions perform trigonometric and logarithmic operations on floating-point operands

How GPUs Work (cont.)

How GPUs Work (cont.) • All functional units and CUDA cores are interconnected by an NoC (network on chip) to a large number of SRAM banks (L2 caches) • Each SM has a 64 KB L1 cache • The 768 KB unified L2 cache is shared by all SMs to serve all load and store operations • Memory controllers are used to connect to 6 GB of off-chip DRAMs • The SM schedules threads in groups of 32 parallel threads called warps • In total, 256/512 FMA (fused multiply and add) operations can be done in parallel to produce 32/64-bit floating-point results • The 512 CUDA cores in an SM can work in parallel to deliver up to 515Gflops of double-precision results

How GPUs Work (cont.) • With 16 SMs, a single GPU has a peak speed of 82.4Tflops • Only 12 Fermi GPUs have the potential to reach the Pflops performance • In the future, thousand-core GPUs may appear in Exascale systems • Reflects a trend toward building future MPPs with hybrid architectures of both types of chips • The progress of GPUs along with CPU advances in power efficiency, performance, and programmability • All systems using the hybrid CPU/GPU architecture consume much less power

GPU Clusters for Massive Parallelism • Commodity GPUs have become high-performance accelerators for data-parallel computing • Modern GPU chips contain hundreds of processor cores per chip • Each GPU chip is capable of achieving up to 1Tflops for single-precision (SP) arithmetic • And more than 80Gflops for double-precision (DP) calculations • Recent HPC-optimized GPUs contain up to 4 GB of on-board memory • Capable of sustaining memory bandwidths exceeding 100 GB/second

GPU Clusters for Massive Parallelism (cont.) • GPU clusters are built with a large number of GPU chips • With the capability to achieve Pflops performance in some of the Top 500 systems • Most GPU clusters are structured with homogeneous GPUs of the same hardware class, make, and model • The software used in a GPU cluster includes the OS, GPU drivers, and clustering API such as an MPI • The high performance of a GPU cluster is attributed mainly to • The massively parallel multicore architecture, and high throughput in multithreaded floating-point arithmetic • Significantly reduced time in massive data movement using large on-chip cache memory

GPU Clusters for Massive Parallelism (cont.) • GPU clusters already are more cost-effective than traditional CPU clusters • Result in a quantum jump in speed performance • Highly reduced space, power, and cooling demands • Can operate with a reduced number of OS images, compared with CPU-based clusters • These reductions in power, environment, and management complexity • Make GPU clusters very attractive for use in future HPC applications • A GPU cluster is often built as a heterogeneous system • Consisting of three major components

GPU Clusters for Massive Parallelism (cont.) • The CPU host nodes, the GPU nodes and the cluster interconnect between them • The GPU nodes are formed with general-purpose GPUs (GPGPUs) to carry out numerical calculations • The host node controls program execution • The cluster interconnect handles inter-node communications • To guarantee the performance • Multiple GPUs must be fully supplied with data streams over high-bandwidth network and memory • Host memory should be optimized to match with the on-chip cache bandwidths on the GPUs

Echelon GPU Cluster Architecture • The architecture of a GPU accelerator chip suggested for Exascale computing • For use in building a NVIDIA GPU cluster • An Echelon GPU chip incorporates 1024 stream cores and 8 latency-optimized CPU-like cores • Eight stream cores form a stream multiprocessor (SM) • There are 128 SMs in the Echelon GPU chip • Each SM is designed with 8 processor cores to yield a 160Gflops peak speed • Totally the chip has a peak speed of 20.48 Tflops • These nodes are interconnected by a NoC to 1,024 SRAM banks (L2 caches) • Each cache bank has a 256 KB capacity

Echelon GPU Cluster Architecture (cont.) • The MCs (memory controllers) are used to connect to off-chip DRAMs • The NI (network interface) is to scale the size of the GPU cluster hierarchically • The architecture of NVIDIA Echelon GPU system

Echelon GPU Cluster Architecture (cont.) • The entire system is built with N cabinets • Labeled C0, C1, ... , CN • Each cabinet is built with 16 compute module • Labeled as M0, M1, ... , M15 • Each compute module is built with 8 GPU nodes • Labeled as N0, N1, ... , N7 • Each GPU node is the innermost block labeled as PC • A single cabinet can house 128 GPU nodes • Each compute module features a performance of 160Tflops and 12.8 TB/s over 2 TB of memory • Each cabinet has the potential to deliver 2.6 Pflops over 32 TB memory and 205 TB/s bandwidth • The N cabinets are interconnected by a Dragonfly network with optical fiber

Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures