1 / 60

Matrix Computation Using Compute Unified Device Architecture (CUDA) )

Matrix Computation Using Compute Unified Device Architecture (CUDA) ). Yagnavalkya N. Satakarni Bommuluri CSCI 699 Thesis Defense. Committee Members: Dr. Andrew. A. Anda Dr. Dennis G uster Dr. Pranava k. Jha. Date: 1/12/2010. OUTLINE. Brief Review of the Literature.

odell
Download Presentation

Matrix Computation Using Compute Unified Device Architecture (CUDA) )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix Computation UsingCompute Unified Device Architecture (CUDA)) Yagnavalkya N. Satakarni Bommuluri CSCI 699 Thesis Defense Committee Members: Dr. Andrew. A. Anda Dr. Dennis Guster Dr. Pranava k. Jha Date: 1/12/2010

  2. OUTLINE

  3. Brief Review of the Literature

  4. Evolution of Computer Architecture Von Neumann Architecture Stored-program concept. Single instruction stream. Single system bus. Harvard Architecture 1) Physically separated storage and signal pathways for instructions and data. Central processing Unit Arithmetic- Logic Unit I/O Equipment Main Memory Program Control Unit Apercus Structure of the IAS Computer

  5. Evolution of Computer Architecture Limitation : Speed of the light Solution: Parallelism Consequence: Parallelism is opted in various forms at different levels to the Von Neumann architecture Legends: I/E: Instruction Fetch and Execute ILP: Instruction Level Parallelism MFU: Multiple Functional Unit RISC: Reduced Instruction Set Computer VLIW: Very Long Instruction Word

  6. Evolution of Computer Architecture Classification of computer architectures based on Flynn’s Classical Taxonomy

  7. Evolution of Computer Architecture Vector Processors • Vector processors are the coprocessors that operate on a linear array of operands called vectors using a single instruction • In the early 1970s vectors architectures are considered as favorable choice for building supercomputers. • The golden era of vectors supercomputers was not until the introduction of CRAY – 1 in the year 1976. • Advancements in CMOS VLSI technology improved the performance the low-cost commodity microprocessors. • In the early nineties, the high cost of vector supercomputers diminished their popularity.

  8. Evolution of Computer Architecture

  9. On-CPU Graphics Advancements(FMA, MMX, SSE) • A CPU is a general purpose processor for performing various arithmetic, logical and control operations. • It is not exclusively built for performing graphics related applications, although it includes certain limited features such as FMA, MMX, SSE etc., that support vector and/or graphics applications • A system architecture that is designed for the graphics related applications off-CPU is referred as a GPU.

  10. On-CPU Graphics Advancements- FMA • A “Fused multiply-add (FMA)” is a compound floating-point operation that computes scalar affine operation (x * y) + z to an infinite precision, and rounds once to the result format.

  11. On-CPU Graphics Advancements- FMA • Motivation - Accuracy Enhancement • Floating-point arithmetic operations are unfaithful • Higham has analyzed various summation methods for floating point numbers, and concluded with a remark that no method guarantees the accuracy in the floating point summation [HIGH93 ]. • The inaccuracies are quite ordinary, in the computation of scalar affine operations of the type f(x,y;z) := ((x * y) ± z • High precision software (such as LBNL quad-double precision) can result in greater accuracy but has performance issues due to overheads. • Therefore, in order to enhance the accuracy along with speed of the floating-point computations, manufactures are beginning to provide reliable rounding affine operation called fused multiply-add (fma)

  12. On-CPU Graphics Advancements- FMA Downside • IEEE standard has neither anticipated nor recommended the FMA instructions. Why? • Inconsistent results (x + iy)*(x - iy) = x2 + y2 + i(xy - yx) • Nonmonotonicity If (a * b) – (c * d) ≥ 0, then a * b ≥ c * d, whence fl(a * b) ≥ fl(c * d) • In conclusion, although fused multiply-add can maximize the performance, it may advice not be inhibited to write the programs due to its undesirable effects [HIGH02, p47].

  13. On-CPU Graphics Advancements– MMX MMX Technology - Introduction • The Intel MMX technology comprises a set of extensions to the Intel Architecture (IA) that are designed to accelerate multimedia and communication applications. • Backward compatibility with previous versions of software and operating systems • New extensions: • Eight MMX 64-bit registers (MM0 through MM7) (aliased with FPU data register stack ) • Four MMX 64-bit data types • The MMX instruction set consists of 57 new instructions

  14. On-CPU Graphics Advancements– MMX • Not to endow the 64-bit capability to the Pentium processors • Instead, MMX 64-bit technology uses the single instruction, multiple data technique for performing arithmetic and logical operations on the bytes, words, or doublewords packed into 64-bit MMX registers. • Speeds up the software performance MMX Technology - Motivation

  15. On-CPU Graphics Advancements– MMX Downside - MMX Register Aliasing • MMX Technology is compatible with IA technology. How is this made possible? • It is possible by aliasing MMX registers and state upon the IA floating-point registers and state. • So once a MMX instruction has start executing, another FPU instruction execution is not possible until EMMS is executed. • Moreover, MMX architecture works only on Integer data types.

  16. On-CPU Graphics Advancements– SSE Streaming SIMD Extension (SSE) Technology • To greatly enhance the performance of advanced media and communications applications. • To address the pitfalls of Intel MMX tech. • New Extensions: • Eight SIMD floating-point registers (XMM0 through XMM7). • SSE data type is a packed, four 32-bit single-precision (SP), floating-point number • The SSE exploits the SIMD technique to perform ALU operation on the single-precision; floating-point values in the 128-bit SIMD floating-point registers. [IEEE 754 ]

  17. On-CPU Graphics Advancements– SSE2 Second iteration: Intel SSE2 • Improves the capabilities of both MMX and SSE instruction set • New Extensions: • Additional 144 instruction. • MMX technology have been extended from 64 to 128 bits • Double-precision (64-bit)floating point SIMD instructions support to SSE technology • Enhancement to the original SSE instruction to support arithmetic operations on multiple data types (e.g., double words and quad words).

  18. On-CPU Graphics Advancements– SSE3 Third iteration: Intel SSE3 • In 2004, Intel has further extended the SSE2 technologies to create SSE3. • New Extensions: • Packed horizontal add instructions that operates horizontally on the data registers . • Support for Hyper-Threading (HT) Technology • Prescott New Instructions (PNI) including instruction to convert floating-point values to integers without changing the global rounding mode . • Deeper pipelining to enable higher frequency . • Useful in areas like complex algorithms and video decoding.

  19. On-CPU Graphics Advancements– SSSE3 Fourth iteration: Intel SSSE3 • Supplemental Streaming SIMD Extension 3 (SSSE3). • 16 new discrete instructions over SSE3 • Each can act on 64-bit MMX or 128-bit XMM registers

  20. On-CPU Graphics Advancements– SSE4 Fifth iteration: Intel SSE4 • Released in September, 2007 • SSE4 is the largest Instruction Set Architecture (ISA) extension with respect to capacity and impact since SSE2. • SSE4 has several compiler vectorization primitives along with new and innovative string processing instructions. • 54 new instructions. • Lacks support operations on 64-bit MMX registers. • Supports 128-bit XMM registers only.

  21. On-CPU Graphics Advancements– SSE5 AMD SSE5 • Announced by AMD on August 30, 2007 • 170 new instructions with greater benefits in domains like HPC, multimedia, and security applications than previously released SSE instruction sets. • It is not a superset of SSE4 but a competitor to it • The new instructions include • Fused multiply accumulate (FMACxx) instructions • Integer multiply accumulate (PMAC, PMADC) instructions • Permutation and conditional move instructions • Vector compare and test instructions • Precision control, rounding, and conversion instructions

  22. Memory Controller MEMORY G P U HOST Interface C P U PCI - E Graphics Processing Unit (GPU) – Introduction • A GPU is a single chip processor that can render 2D and 3D graphics. • Technically, a GPU involves the execution of graphics pipeline to render 3D graphics. • Graphics Processing Unit (GPU) act as coprocessor to CPU • Off-CPU Graphics Advancements - GPU

  23. Graphics Processing Unit (GPU) – History Off-CPU Graphics Advancements - GPU

  24. Graphics Processing Unit (GPU) – History Off-CPU Graphics Advancements - GPU

  25. Difference – CPU & GPU • Over the last decade GPUs are getting faster rapidly with tremendous memory and computation horsepower . • The GPUs have grown exponentially at a rate far estimated by Moore’s law. • Why not same with CPU? - same fabrication tech., • Difference is with how the transistors are employed? • Off-CPU Graphics Advancements - GPU GPU CPU

  26. The term General-Purpose Computation on GPU (GPGPU) refers to the computation on graphics hardware for non-graphics applications. • The GPGPU community of developers and researchers (http://www.gpgpu.org/) has started their early ground work of exploiting graphics hardware resources for non-graphics applications. • Why we need GPGPU? • Multiple Programmable shaders • Powerful (Memory Bandwidth, Computational Horsepower (GFLOPS)) • Accuracy • Affordable • Less Power Consumption • GPGPU

  27. GPGPU

  28. Graphics API Programming Model • Graphics API languages like Open GL, Direct3D. • Need CPU-GPU domain correlations. • Adding two numbers procedure : • GPGPU

  29. CPU APPLICATION CUDA Libraries CUDA Runtime CUDA Drivers GPU NVIDIA CUDA • Compute Unified Device Architecture (CUDA). • Anovel hardware architecture and software for issuing and managing computations on the GPU as a massive data-parallel computing device. • Does not require mapping CPU computational concepts. • CUDA uses a C-like language with minimal extensions for general purpose computation on GPU. • GeForce 8 Series, Quadro FX 5600/4600, and Tesla solutions. • GPGPU

  30. CUDA – Shared Memory • Enables General Read and Write Access of Data by Threads • Brings Data Closer To ALUs • GPGPU

  31. CUDA – Programming Model • GPGPU Func<<< Dg, Db, Ns >>(parameter);

  32. CUDA – Memory Model • GPGPU

  33. CUDA – Hardware Model • GPGPU

  34. CUDA – Restriction • Threads and thread blocks can be created only by launching a parallel kernel, not from within a parallel kernel. • CUDA requires that thread blocks be independent as communication between thread blocks within the same grid of blocks is not allowed. • CUDA allows blocks to be executed in any order. The results computed by multiple blocks are combined by invoking a second kernel. • Recursive functions are not allowed in the CUDA. • The overhead of CPU-GPU coordination and data transfers can reduce the overall performance of the CUDA-enabled application. • GPGPU

  35. Basic Linear Algebra Subprograms (BLAS) • API standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. • Originally implemented in Fortran. • C interface also available (referred as cblas) • Not to confuse with BLACS. • BLAS Levels: • BLAS

  36. NetlibBLAS • Netlib contains the repository of implementation of BLAS complied for various operating systems and hardware. • Also known as generic BLAS. • Not considered as optimized BLAS. • High Performance Non-vendor BLAS • Automatically Tuned Linear Algebra Software (ATLAS), Goto BLAS and μBLAS. • More efficient than generic Netlib BLAS. • Ideal for the new system architecture or to the systems that don’t have their architecture-specific often vendor-specific BLAS. • Vendor-specific BLAS • Built exclusively for specific hardware architecture. • Usually efficient that Non-vendor BLAS. • BLAS

  37. Vendor-specific BLAS • BLAS

  38. Given a square matrix A of Cn X mand a vector v of Cn. • The order-m Krylov subspace is defined as the linear subspace spanned by the base vectors of v under the first m powers of A. • Assuming A0 = I, an order-m Krylov subspace denoted by Km can be equated to a subspace span as shown below: • All algorithms that work this way or use Krylov subspace span as their foundation are called Krylov subspace methods • Krylov Subspace

  39. Statement of the Problem

  40. Motivation for the Study • The purpose of this thesis is to study the performance enhancement of dense matrix and vector computations on a GPU using CUDA. • Immediate benefits for the scientific computation, engineering and high performance numerical computation (HPNC). • Intend to study the impact on performance by implementing certain Level-1, Level-2 and Level-3 BLAS routines on Tesla C870 processor board (GPU) using CUBLAS 2.0 and on 2.3 GHz Quad-core AMD Phenom processor (CPU) using Intel MKL 10.1. • Furthermore, to understand the reusability of data on CUDA, we also computed Krylov subspace for various orders of matrices and base vectors by implementing Level-2 BLAS iterations.

  41. Objective of the Study

  42. The CUDA programming model utilizes similar performance efficiencies, as related to models for programming on graphics hardware. • The CUBLAS 2.0 and Intel MKL 10.1 BLAS are the efficient (in exploiting the hierarchal hardware architecture), optimized, and threaded BLAS implemented for GPU and CPU architecture, respectively. • The CUDA programming model is a scalable and efficient general purpose computation on a GPU. • Assumptions

  43. We will only include the practical implementation of CUDA programming model, using CUBLAS, and the CPU model, using Intel MKL BLAS. • We do not include other GPU programming models due to lack of testing facilities. • We include the implementation of SGEMM (BLAS Level -3), SGEMV (BLAS Level -2), SDOT (BLAS Level -1) and SAXPY (BLAS Level -1). Furthermore, we will find Krylov subspace of various orders using Level-2 BLAS routines. The implementation of other BLAS and/or algorithms is not included. • Constraints

  44. Under which contexts can CUDA outperform the CPU model for dense matrix computations? • Is CUDA a scalable and efficient parallel programming model? • Hypotheses Questions

  45. Hardware Description • Methodology

  46. Software Description • Methodology

  47. BLAS Implemented • Additionally, computed Krylov subspace using level-2 BLAS. • Methodology

  48. Install Windows Vista Business 64-bit Edition. • For Intel MKL BLAS implementation • Install Microsoft Visual Studio 2008 (MSVS 2008) Pro Edition. • Install Intel Math Kernel Library . • Link the program with Intel® MKL from the Microsoft Visual C++. • Create, build and run program(s). • For CUBLAS implementation • Install Tesla C870 processor board as per the manual. • Install CUDA Toolkit, SDK, drivers version 2.0 for Vista 64-bit. • Install Microsoft Visual Studio 2005 (MSVS 2005) Pro Edition. • Create a new CUDA project from template. Set the “Configuration Manager” to x64 platform. Build and execute the application on x64 bit mode. • Methodology

  49. The following three variables are measured in each experiment for various sizes of square matrices and/or vectors using clock_t clock(void) function available in the header file <time.h>. • The Intel MKL BLAS computation time in milliseconds. • The CUBLAS computation time is measured as the number of clock ticks elapsed for the execution of the CUBLAS kernel on the GPU device. • CUBLAS total time = host to GPU data transfer time + CUBLAS computation time + GPU to host data transfer time. • Results - Interpretations /Analysis

  50. Graphs: • Y-axis is time in milliseconds. • X-axis is the size of the square matrix and/or vector. • The maximum range of the X-axis is equal to maximum size of the matrix and/or vector that can terminate the execution of the experiment. • Each experiment is repeated 5 times. • Symbols used to represent time variable: • Results - Interpretations /Analysis

More Related