1 / 10

Performance

Performance. What hardware accelerators are you using/evaluating ? Cells in a Roadrunner configuration 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility Fast (25.6 GB/s) & large memory (4GB or larger)

kris
Download Presentation

Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance • What hardware accelerators are you using/evaluating? • Cells in a Roadrunner configuration • 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility • Fast (25.6 GB/s) & large memory (4GB or larger) • Augmented C language; also C++ & now Fortran; GNU & XL variants; OpenMP is new; OpenCL is being prototyped • Opterons can run bulk of code not needing acceleration; Cell-only clusters possible

  2. Performance • What hardware accelerators are you using/evaluating? Several years ago… • GPUs (pre CUDA & Tesla) • Brook & Scout (LANL data-parallel language) • No 32bit at the time; limited memory; everything is a data-parallel problem • No ECC memory ; insufficient parity/ECC protection of data paths and logic • Others at LANL still working in this area including Tesla & CUDA) • Clearspeed (several years ago) • Earliest Clearspeeds before the Advance families • Augmented C language; 96 SIMD PEs • Everything is done as long SIMD data parallel and in synch • Low power • FPGAs (HDL, several years ago) • Programming is hard -- very hard • Logic space limited the number of 64bit ops • Fast SRAM but small; external DRAM modest size but no faster than CPUs • One algorithm at a time, so significant impact to use for multi-physics • Low power

  3. Performance • Describe the applications that you are porting to accelerators? • MD (materials), laser-plasma PIC, IMC X-ray (particle) transport, GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD • Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities My personal observations: • Particle methods are generally easiest • Codes with good characteristics: • A few computationally intense “algorithms” • pre-existing or obvious “fine-grain” parallel work units • C language versus Fortran or highly OO C++

  4. Performance • Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)? • 5x to 10X over single-Opteron-core for code with high memory BW intensive and 5%-10% peak • 10x to 25x on particle methods, searches, etc. • How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements? • Scale out via more sockets is better – BUT! • Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL-sized machines means $$$ for HW, space, & power • Scaling out by multi-core is not a clear winner • Memory BW and cache architectures often limit performance which Cells mostly get around • Memory BW per core is decreasing at “inverse Moore’s law” rate!

  5. Economics • Describe the programming effort required to make use of the accelerator. • ½ to 1 man-year to “convert” a code, mostly dealing with data structures and threaded parallelism designs. • Lack of debugging & similar tools are like the earliest days of parallel computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs) • We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level) • Amortization • Ready for the future – codes and skilled programmers. We expect our dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well. • It’s not just about running codes this year. Others will have to work through new forms of parallelism soon. • We can do science now that isn’t possible with most other machines

  6. Economics • Compare accelerator cost to scaling out cost • Commodity-processor-only machines would have cost 2X what Roadrunner did in 2006-2007 (~$80M more) • Used 2X or more power (~$1M per MW) • Significantly larger nodes counts cause scaling & reliability issues • Accelerators or heterogeneous chips should be Greener • Ease of use issues • Newer Cell programming techniques (ALF, OpenMP) could make this easier. • A Cell cluster would be easier, but the PPE is really, really slow for non- SPU accelerated code segments. • Not for the faint of heart, but Top20 machines never are

  7. Futures • What is the future direction of hardware based accelerators? • Domain specific libraries can make them far more useful in those specific areas • Some may appear on Intel QPI or AMD HT. • Specialized cores will show up within commodity microprocessors – ignore them or use them • GPU-based systems will have to adopt ECC & partity protection • Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-processor) • Software futures? • OpenCL looks promising but doesn’t address programming the specialized accelerator devices themselves • The uber-auto-wizard-compiler will never come • Heterogeneous compilers may come. • Debuggers & tools may come • What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators? • Create next generation versions and sell as mainstream products

  8. Steps in a Cell Conversion • Compile & run on PowerPC PPE • Identify & isolate algorithm & data to run parallel on 8 “remote” SPEs • Compile scalar version of algorithm on SPE • Add SPE thread process control • Add DMAs • Use “blocking” DMAs at this stage just for functionality • Worry about data alignments • First on a single SPE, then on 8 SPEs • Optimize SPE code • SIMD, branchesmerges • Add asynch double/triple buffering of DMAs • For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”

  9. Roadrunner & LANL: addressing the shock moving through high-performance computing • Roadrunner is more than a petascale supercomputer for today’s use • provides a balanced platform to explore new algorithm design, programming models, and to refresh developer skills • LANL has been an early adopter of transformational technology*: • 1970s: HPC is scalarLANL adopts vector (Cray 1 w/ no OS) • 1980s: HPC is vector LANL adopts data parallel (big CM-2) • 2000s: HPC is multi-core clusters LANL adopts hybrid (Roadrunner) *Credit to Scott Pakin, CCS-1, for this list idea

  10. MPI (5B) Non-accelerated code Node Memory Node (Opteron) (5a) (6) (2) (1) PCIe link Cell Memory Serial PPC Processor Each SPE DMA multi-buffers Cell data into local memory (3) (4) until done Parallel SPE Processors Each SPE computes within its local memory buffers Simultaneously (4) Each SPE DMA multi-buffers data back to Cell memory Local Memories 8-way parallel Non-accelerated code Perspective: Fun or Nightmare? Opteron Cell PPC Cell SPE (x8 parallel) MPI (1) Host launches Cell code DaCS Host data pushed/pulled to Cell (2) Cell spawns parallel threads on SPEs (3) Node may need to push/pull more data to/from Cell & to/from cluster or could be available for concurrent work during this time (5b) (5a) DMA MPI DMA Parallel threads completed (6) DaCS Updated data pushed/pulled to Host Cell code completed MPI How much can be automatedin compilers or languages?

More Related