1 / 81

CUDA Lecture 1 Introduction to Massively Parallel Computing

CUDA Lecture 1 Introduction to Massively Parallel Computing. Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CPUs and GPUs. A quiet revolution and potential buildup Computation: TFLOPs vs . 100 GFLOPs CPU in every PC – massive volume and potential impact.

suchin
Download Presentation

CUDA Lecture 1 Introduction to Massively Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Lecture 1Introduction to Massively Parallel Computing Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

  2. CPUs and GPUs • A quiet revolution and potential buildup • Computation: TFLOPs vs. 100 GFLOPs • CPU in every PC – massive volume and potential impact T12 GT200 G80 3GHz Xeon Quad G70 Westmere 3GHz Core2 Duo NV30 NV40 Introduction to Massively Parallel Processing – Slide 2

  3. Topic 1: The Demand for Computational Speed Introduction to Massively Parallel Processing – Slide 2

  4. Why Faster Computers? • Solve computation-intensive problems faster • Make infeasible problems feasible • Reduce design time • Solve larger problems in the same amount of time • Improve an answer’s precision • Reduce design time • Gain competitive advantage Introduction to Massively Parallel Processing – Slide 4

  5. Why Faster Computers? • Another factor: modern scientific method Nature Observation Numerical Simulation Physical Experimentation Theory Introduction to Massively Parallel Processing – Slide 5

  6. The Need for Speed • A growing number of applications in science, engineering, business and medicine are requiringcomputing speeds that cannot be achieved by the current conventional computers because • Calculations too computationally intensive • Too complicated to model • Too hard, expensive, slow or dangerous for the laboratory. • Parallel processing is an approach which helps to make these computations feasible. Introduction to Massively Parallel Processing – Slide 6

  7. The Need for Speed • One example: grand challenge problems that cannot be solved in a reasonable amount of time with today’s computers. • Modeling large DNA structures • Global weather forecasting • Modeling motion of astronomical bodies • Another example: real-timeapplications which have time constrains. • If data are not processed during some time the result become meaningless. Introduction to Massively Parallel Processing – Slide 7

  8. One Example: Weather Forecasting • Atmosphere modeled by dividing it into 3-dimensional cells. Calculations of each cell repeated many times to model passage of time. • For example • Suppose whole global atmosphere divided into cells of size 1 mile  1 mile  1 mile to a height of 10 miles (10 cells high) – about 5108 cells. • If each calculation requires 200 floating point operations, 1011 floating point operations necessary in one time step. Introduction to Massively Parallel Processing – Slide 8

  9. One Example: Weather Forecasting • Example continued • To forecast the weather over 10 days using 10-minute intervals, a computer operating at 100 megaflops (108 floating point operations per second) would take 107 seconds or over 100 days. • To perform the calculation in 10 minutes would require a computer operating at 1.7 teraflops (1.71012 floating point operations per second). Introduction to Massively Parallel Processing – Slide 9

  10. Concurrent Computing • Two main types • Parallel computing: Using a computer with more than one processor to solve a problem. • Distributed computing: Using more than one computer to solve a problem. • Motives • Usually faster computation • Very simple idea – that n computers operating simultaneously can achieve the result n times faster • It will not be n times faster for various reasons. • Other motives include: fault tolerance, larger amount of memory available, … Introduction to Massively Parallel Processing – Slide 10

  11. Parallel Programming No New Concept “... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difficulty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation …” - S. Gill, “Parallel Programming”, The Computer Journal, April 1958 Introduction to Massively Parallel Processing – Slide 11

  12. The Problems of Parallel Programs • Applications are typically written from scratch (or manually adapted from a sequential program) assuming a simple model, in a high-level language (i.e. C) with explicit parallel computing primitives. • Which means • Components difficult to reuse so similar components coded again and again • Code difficult to develop, maintain or debug • Not really developing a long-term software solution Introduction to Massively Parallel Processing – Slide 12

  13. Another Complication: Memory Processors Registers Highest level Small, fast, expensive Memory level 1 Memory level 2 Memory level 3 Lowest level Large, slow, cheap Main Memory Introduction to Massively Parallel Processing – Slide 13

  14. Review of the Memory Hierarchy • Execution speed relies on exploiting data locality • temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache • spatial locality: neighboring data is also likely to be used soon, so load them into the cache at the same time using a ‘wide’ bus (like a multi-lane motorway) • from a programmer point of view, all handled automatically – good for simplicity but maybe not for best performance Introduction to Massively Parallel Processing – Slide 14

  15. The Problems of the Memory Hierarchy • Useful work (i.e. floating point ops) can only be done at the top of the hierarchy. • So data stored lower must be transferred to the registers before we can work on it. • Possibly displacing other data already there. • This transfer among levels is slow. • This then becomes the bottleneck in most computations. • Spend more time moving data than doing useful work. • The result: speed depends to a large extent on problem size. Introduction to Massively Parallel Processing – Slide 15

  16. The Problems of the Memory Hierarchy • Good algorithm design then requires • Keeping active data at the top of the hierarchy as long as possible. • Minimizing movement between levels. • Need enough work to do at the top of the hierarchy to mask the transfer time at the lower levels. • The more processors you have to work with, the larger the problem has to be to accomplish this. Introduction to Massively Parallel Processing – Slide 16

  17. Topic 2: CPU Architecture Introduction to Massively Parallel Processing – Slide 17

  18. Starting Point: von Neumann Processor • Each cycle, CPU takes data from registers, does an operation, and puts the result back • Load/store operations (memory registers) also take one cycle • CPU can do different operations each cycle • Output of one operation can be input to next • CPUs haven’t been this simple for a long time! Introduction to Massively Parallel Processing – Slide 18

  19. Why? Moore’s Law “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.” - Gordon E. Moore, co-founder, Intel Electronics Magazine, 4/19/1965 Photo credit: Wikipedia Introduction to Massively Parallel Processing – Slide 19

  20. Moore’s Law • Translation: The number of transistors on an integrated circuit doubles every 1½ - 2 years. Data credit: Wikipedia Introduction to Massively Parallel Processing – Slide 20

  21. Moore’s Law • As the chip size (amount of circuitry) continues to double every 18-24 months, the speed of a basic microprocessor also doubles over that time frame. • This happens because, as we develop tricks, they are built into the newest generation of CPUs by the manufacturers. Introduction to Massively Parallel Processing – Slide 21

  22. So What To Do With All The Extra Transistors • Instruction-level parallelism (ILP) • Out-of-order execution, speculation, … • Vanishing opportunities in power-constrained world • Data-level parallelism • Vector units, SIMD execution, … • Increasing opportunities; SSE, AVX, Cell SPE, Clearspeed, GPU, … • Thread-level parallelism • Increasing opportunities; multithreading, multicore, … • Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, … Introduction to Massively Parallel Processing – Slide 22

  23. Modern Processor: Superscalar • Most processors have multiple pipelines for different tasks and can start a number of different operations each cycle. • For example consider the Intel core architecture. Introduction to Massively Parallel Processing – Slide 23

  24. Modern Processor: Superscalar • Each core in an Intel Core 2 Duo chip has • 14 stage pipeline • 3 integer units (ALU) • 1 floating-point addition unit (FPU) • 1 floating-point multiplication unit (FPU) • FP division is very slow • 2 load/store units • In principle capable of producing 3 integer and 2 FP results per cycle Introduction to Massively Parallel Processing – Slide 24

  25. Modern Processor: Superscalar • Technical challenges • Compiler to extract best performance, reordering instructions if necessary • Out-of-order CPU execution to avoid delays waiting for read/write or earlier operations • Branch prediction to minimize delays due to conditional branching (loops, if-then-else) • Memory hierarchy to deliver data to registers fast enough to feed the processor • These all limit the number of pipelines that can be used and increase the chip complexity • 90% of Intel chip devoted to control and data Introduction to Massively Parallel Processing – Slide 25

  26. Tradeoff: Performance vs. Power (courtesy Mark Horowitz and Kevin Skadron) Power Performance Introduction to Massively Parallel Processing – Slide 26

  27. Current Trends • CPU clock stuck at about 3 GHz since 2006 due to problems with power consumption (up to 130W per chip) • Chip circuitry still doubling every 18-24 months • More on-chip memory and memory management units (MMUs) • Specialized hardware (e.g. multimedia, encryption) • Multi-core (multiple CPUs on one chip) • Thus peak performance of chip still doubling every 18-24 months Introduction to Massively Parallel Processing – Slide 27

  28. A Generic Multicore Chip • Handful of processors each supporting ~1 hardware thread • On-chip memory near processors (cache, RAM or both) • Shared global memory space (external DRAM) Global Memory Processor Memory Processor Memory Introduction to Massively Parallel Processing – Slide 28

  29. A Generic Manycore Chip • Many processors each supporting many hardware threads • On-chip memory near processors (cache, RAM or both) • Shared global memory space (external DRAM) Global Memory Processor Memory Processor Memory • • • Introduction to Massively Parallel Processing – Slide 29

  30. Example: Intel Nehalem 4-core processor • Four 3.33 GHz cores, each of which can run 2 threads • Integrated MMU Introduction to Massively Parallel Processing – Slide 30

  31. So The Days of Serial Performance Scaling are Over • Cannot continue to scale processor frequencies • No 10 GHz chips • Cannot continue to increase power consumption • Can’t melt chips • Can continue to increase transistor density • As per Moore’s Law Introduction to Massively Parallel Processing – Slide 31

  32. Future Applications Reflect a Concurrent World • Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” • Molecular dynamic simulation, video and audio coding and manipulation, 3D imaging and visualization, consumer game physics, virtual reality products, … • These “super apps” represent and model physical, concurrent world • Various granularities of parallelism exist, but… • Programming model must not hinder parallel implementation • Data delivery needs careful management Introduction to Massively Parallel Processing – Slide 32

  33. The “New” Moore’s Law • Computers no longer get faster, just wider • You must rethink your algorithms to be parallel! • Data-parallel computing is most scalable solution • Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16 cores, … • You will always have more data than cores – build the computation around the data Introduction to Massively Parallel Processing – Slide 33

  34. Stretching Traditional Architectures • Traditional parallel architectures cover some super apps • DSP, GPU, network and scientific applications, … • The game is to grow mainstream architectures “out” or domain-specific architectures “in” • CUDA is the latter Introduction to Massively Parallel Processing – Slide 34

  35. Topic 3: Graphics Processing Units (GPUs) Introduction to Massively Parallel Processing – Slide 35

  36. GPUs: The Big Development • Massive economies of scale • Massively parallel Introduction to Massively Parallel Processing – Slide 36

  37. GPUs: The Big Development • Produced in vast numbers for computer graphics • Increasingly being used for • Computer game “physics” • Video (e.g. HD video decoding) • Audio (e.g. MP3 encoding) • Multimedia (e.g. Adobe software) • Computational finance • Oil and gas • Medical imaging • Computational science Introduction to Massively Parallel Processing – Slide 37

  38. GPUs: The Big Development • The GPU sits on a PCIe graphics card inside a standard PC/server with one or two multicore CPUs Introduction to Massively Parallel Processing – Slide 38

  39. GPUs: The Big Development • Up to 448 cores on a single chip • Simplified logic (no out-of-order execution, no branch prediction) means much more of the chip is devoted to computation • Arranged as multiple units with each unit being effectively a vector unit, all cores doing the same thing at the same time • Very high bandwidth (up to 140GB/s) to graphics memory (up to 4GB) • Not general purpose – for parallel applications like graphics and Monte Carlo simulations • Can also build big clusters out of GPUs Introduction to Massively Parallel Processing – Slide 39

  40. GPUs: The Big Development • Four major vendors: • NVIDIA • AMD: bought ATI several years ago • IBM: co-developed Cell processor with Sony and Toshiba for Sony Playstation, but now dropped it for high-performance computing • Intel: was developing “Larrabee” chip as GPU, but now aimed for high-performance computing Introduction to Massively Parallel Processing – Slide 40

  41. CPUs and GPUs • A quiet revolution and potential buildup • Computation: TFLOPs vs. 100 GFLOPs • CPU in every PC – massive volume and potential impact T12 GT200 G80 3GHz Xeon Quad G70 Westmere 3GHz Core2 Duo NV30 NV40 Introduction to Massively Parallel Processing – Slide 41

  42. CPUs and GPUs • A quiet revolution and potential buildup • Bandwidth: ~10x • CPU in every PC – massive volume and potential impact T12 GT200 G80 G70 3GHz Xeon Quad NV40 3GHz Core2 Duo Westmere NV30 Introduction to Massively Parallel Processing – Slide 42

  43. GPU Evolution • High throughput computation • GeForce GTX 280: 933 GFLOPs/sec • High bandwidth memory • GeForce GTX 280: 140 GB/sec • High availability to all • 180M+ CUDA-capable GPUs in the wild Introduction to Massively Parallel Processing – Slide 43

  44. GPU Evolution “Fermi”3B xtors GeForce 8800 681M xtors GeForce FX 125M xtors GeForce 3 60M xtors GeForce® 256 23M xtors RIVA 128 3M xtors 1995 2000 2005 2010 Introduction to Massively Parallel Processing – Slide 44

  45. Lessons from the Graphics Pipeline • Throughput is paramount • Must paint every pixel within frame time • Scalability • Create, run and retire lots of threads very rapidly • Measured 14.8 Gthreads/sec on increment() kernel • Use multithreading to hide latency • One stalled thread is o.k. if 100 are ready to run Introduction to Massively Parallel Processing – Slide 45

  46. Why is this different from a CPU? • Different goals produce different designs • GPU assumes work load is highly parallel • CPU must be good at everything, parallel or not • CPU: minimize latency experienced by one thread • Big on-chip caches, sophisticated control logic • GPU: maximize throughput of all thread • Number of threads in flight limited by resources  lots of resources (registers, bandwidth, etc.) • Multithreading can hide latency  skip the big caches • Share control logic across many threads Introduction to Massively Parallel Processing – Slide 46

  47. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory NVIDIA GPU Architecture • GeForce 8800 (2007) Introduction to Massively Parallel Processing – Slide 47

  48. G80 Characteristics • 16 highly threaded streaming multiprocessors (SMs) • > 128 floating point units (FPUs) • 367 GFLOPs peak performance (25-50 times that of current high-end microprocessors) • 265 GFLOPs sustained for applications such as visual molecular dynamics (VMD) • 768 MB DRAM • 86.4 GB/sec memory bandwidth • 4 GB/sec bandwidth to CPU Introduction to Massively Parallel Processing – Slide 48

  49. G80 Characteristics • Massively parallel, 128 cores, 90 watts • Massively threaded, sustains 1000s of threads per application • 30-100 times speedup over high-end microprocessors on scientific and media applications • Medical imaging, molecular dynamics, … Introduction to Massively Parallel Processing – Slide 49

  50. NVIDIA GPU Architecture • Fermi GF100 (2010) • ~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double precision) • 230 GB/sec DRAM bandwidth Introduction to Massively Parallel Processing – Slide 50

More Related