1 / 38

Feeding the Multicore Beast: It’s All About the Data!

Feeding the Multicore Beast: It’s All About the Data!. Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept. Outline. History: Data challenge Motivation for multicore Implications for programmers How Cell addresses these implications Examples 2D/3D FFT

kineks
Download Presentation

Feeding the Multicore Beast: It’s All About the Data!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feeding theMulticore Beast:It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

  2. Outline • History: Data challenge • Motivation for multicore • Implications for programmers • How Cell addresses these implications • Examples • 2D/3D FFT • Medical Imaging, Petroleum, general HPC… • Green’s Functions • Seismic Imaging (Petroleum) • String Matching • Network Processing: DPI & Intrusion Detections • Neural Networks • Finance mpp@us.ibm.com

  3. Chapter 1:The Beast is Hungry! mpp@us.ibm.com

  4. The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources mpp@us.ibm.com

  5. The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources • If flops grow faster than pipe capacity… … the beast gets hungrier! mpp@us.ibm.com

  6. Move the food closer • Example: Intel Tulsa • Xeon MP 7100 series • 65nm, 349mm2, 2 Cores • 3.4 GHz @ 150W • ~54.4 SP GFlops • http://www.intel.com/products/processor/xeon/index.htm • Large cache on chip • ~50% of area • Keeps data close for efficient access • If the data is local,the beast is happy! • True for many algorithms mpp@us.ibm.com

  7. What happens if the beast is still hungry? Cache • If the data set doesn’t fit in cache • Cache misses • Memory latency exposed • Performance degraded • Several important application classes don’t fit • Graph searching algorithms • Network security • Natural language processing • Bioinformatics • Many HPC workloads Data mpp@us.ibm.com

  8. Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS Data mpp@us.ibm.com

  9. Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS • But… • Important application working sets are growing faster • Multicore even more demanding on cache than uni-core Data mpp@us.ibm.com

  10. Chapter 2:The Beast Has Babies mpp@us.ibm.com

  11. Power Density – The fundamental problem mpp@us.ibm.com

  12. What’s causing the problem? 65 nM 1000 100 10 Gate Stack Power Density (W/cm2) 1 0.1 0.01 0.001 1 0.1 0.01 Gate dielectricapproaching a fundamental limit (a few atomic layers) Gate Length (microns) Power, signal jitter, etc... mpp@us.ibm.com

  13. Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency-DrivenDesignPoints mpp@us.ibm.com

  14. .85 1.7 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1.45 1 1.3 mpp@us.ibm.com

  15. Implications of Multicore • There are more mouths to feed • Data movement will take center stage • Complexity of cores will stop increasing … and has started to decrease in some cases • Complexity increases will center around communication • Assumption • Achieving a significant % or peak performance is important mpp@us.ibm.com

  16. Chapter 3:The Proper Care and Feeding of Hungry Beasts mpp@us.ibm.com

  17. Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W mpp@us.ibm.com

  18. SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC PPU L1 PXU 16B/cycle Feeding the Cell Processor SPE • 8 SPEs each with • LS • MFC • SXU • PPE • OS functions • Disk IO • Network IO 16B/cycle EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle (2x) PPE MIC BIC L2 32B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX mpp@us.ibm.com

  19. Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth mpp@us.ibm.com

  20. Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth • BOTTOM LINE:It’s all about the data! mpp@us.ibm.com

  21. Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale) mpp@us.ibm.com

  22. Memory Managing Processor vs. Traditional General Purpose Processor Cell BE AMD IBM Intel mpp@us.ibm.com

  23. Examples of Feeding Cell • 2D and 3D FFTs • Seismic Imaging • String Matching • Neural Networks (function approximation) mpp@us.ibm.com

  24. Feeding FFTs to Cell • SIMDized data • DMAs double buffered • Pass 1: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer • Pass 2: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer Tile Buffer Input Image Transposed Image Transposed Tile Transposed Buffer mpp@us.ibm.com

  25. 3D FFTs • Long stride trashes cache • Cell DMA allows prefetch Stride N2 Single Element Data envelope N Stride 1 mpp@us.ibm.com

  26. Feeding Seismic Imaging to Cell Data Green’s Function • New G at each (x,y) • Radial symmetry of G reduces BW requirements (X,Y) mpp@us.ibm.com

  27. Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 mpp@us.ibm.com

  28. Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 mpp@us.ibm.com

  29. Feeding Seismic Imaging to Cell 2R+1 • For each X • Load next column of data • Load next column of indices • For each Y • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X,Y) • Cycle buffers (X,Y) R H 1 Data buffer Green’s Index buffer 2 mpp@us.ibm.com

  30. Feeding String Matching to Cell Sample Word List: “the”“that”“math” • Find (lots of) substrings in (long) string • Build graph of words & represent as DFA • Problem: Graph doesn’t fit in LS mpp@us.ibm.com

  31. Feeding String Matching to Cell mpp@us.ibm.com

  32. Hiding Main Memory Latency mpp@us.ibm.com

  33. Software Multithreading mpp@us.ibm.com

  34. Feeding Neural Networks to Cell F Output • Neural net function F(X) • RBF, MLP, KNN, etc. • If too big for LS, BW Bound N Basis functions: dot product + nonlinearity DxN Matrix of parameters D Input dimensions X mpp@us.ibm.com

  35. Convert BW Bound to Compute Bound Merge • Split function over multiple SPEs • Avoids unnecessary memory traffic • Reduce compute time per SPE • Minimal merge overhead mpp@us.ibm.com

  36. Moral of the Story:It’s All About the Data! • The data problem is growing: multicore • Intelligent software prefetching • Use DMA engines • Don’t rely on HW prefetching • Efficient data management • Multibuffering: Hide the latency! • BW utilization: Make every byte count! • SIMDization: Make every vector count! • Problem/data partitioning: Make every core work! • Software multithreading: Keep every core busy! mpp@us.ibm.com

  37. Backup mpp@us.ibm.com

  38. Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell. mpp@us.ibm.com

More Related