1 / 24

Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce

Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove : Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory

devi
Download Presentation

Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab SudhakarYalamanchili: Georgia Tech Data Movement Dominates (DMD) and CoDEx: CoDesign for Exascale

  2. CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation

  3. CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)

  4. CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CAL: (Sandia/LBL) Computer Architecture Laboratory SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)

  5. Fidelity vs. Scope for Architectural Simulation Methods

  6. ROSE CompilerFull Program Understanding through Deep Source-Code Analysis

  7. ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction • Can automatically predict performance for many input codes and software optimizations • Predict performance under different architectural scenarios • Much faster than hardware simulation and manual modeling Performance Prediction Spreadsheet Machine Parameters Performance Model Combustion Codes Compiler Analysis <XML> Dependency Graph Optimization User Parameters

  8. SST/macro: Coarse-Grained Simulation An application code with minor modifications SST/Macro Impl. of interfaces (MPI), which simulate execution and communication

  9. SST/micro: Cycle-Accurate Framework • Has a general simulation framework for integrating models • Simulation backend is parallel • Plenty of people involved

  10. Some Models Currently Integrated Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components. MacSimprovides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces. IRISprovides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSimmodels photonic networks.

  11. Leveraging Embedded Design AutomationFor Design Space Exploration This stuff is essential!

  12. Embedded Design Automation(Using FPGA emulation to do rapid prototyping) RAMP FPGA-accelerated Emulation of ASIC Or “tape out” To FPGA

  13. Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology • Data movement are projected to account for over 75% of power budget for an exascale platform • Work to reduce that via • Optical interconnect(s) • 3D stacking (logic + memory + optics) • New memory protocols Research Questions • What is the performance potential of stacked memory (power & speed) • How much intelligence to put into logic layer • Atomics, gather/scatter, checksums, full-processor-in-memory • What is the memory consistency model for intelligent DRAM • How to program it if we put embed more intelligence into DRAM

  14. The Cost of Moving Data

  15. Locality Management is KeyWhat are the best combination of software and hardware mechanisms to maximize data movement efficiency Vertical Locality Management Horizontal Locality Management Sun Microsystems Temporal Topological

  16. Why Study Chip Stacking (TSVs)?Energy = (V 2 ∗ C) ∗ Overhead + Ecomm DRAM Cells Efficient TSVs Reduce Costs TSVs orders of magnitude less energy –250 fJ/bit for reading DRAM –5 fJ/bit for TSV –250 fJ/bit for mem. controller –~0.5 pJ/bit (compared to 30pJ for conventional DIMM) –Don’t have to access more data than needed • Enables....–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation • Requires –...changes to how we view the machine & the memory • DRAM cells require < 1 pJ to access • Current DRAM architectures are not power efficient • Long distances ➔ high power • We pay for more than we get at every level • Cache: throw away 75-80% • DRAM Row: Charge 1024B for each 64B access • DIMM: Charge 8-9 chips/access • ~800 pJ/byte total • DRAM design driven by packaging constraints • ~50% of DRAM chip cost is packaging, mainly in pins • DIMMs use multiple chips with a few data pins to achieve high BW

  17. Why Photonics? Photonics changes the rules for Bandwidth-per-Watt. ELECTRONICS: • Buffer, receive, and re-transmitat every router. • Space Parallelism:Each bus lane routed independently (P  NLANES). • Off-chip BW requires much more power than on-chip BW. Photonics: • Modulate/receive data stream once per communication event. • Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream. • Off-chip BW ≈ on-chip BW for nearly same power. RX RX RX RX RX TX RX TX TX TX TX TX

  18. Why Optically-Connected Memory? Traditional Memory Optically-Connected Memory HBDRAM HBDRAM HBDRAM HBDRAM CPU CPU HBDRAM HBDRAM HBDRAM HBDRAM Electronic Bus Optical Link • Large Pin-out • Complex wiring • Low bandwidth density • Distance constrained by electrical limitations • High power dissipation • All-optical link, no electronic bus to drive • Bit-rate transparent link • High bandwidth density, less pins • Distance immunity at computer scale • Low power dissipation Will not scale to meet power and bandwidth requirements of future high-performance computing systems Enables scaling of high-performance computing through increased memory capacity and bandwidth

  19. Mixed Model Simulationcycle accurate and energy-accurate models MPI Traces (DUMPI) kernels skeleton app (C, C++, Fortran) Processor Model (SST/micro & Tensilica) SST/macro Workload Translation Checkpoint/restart SystemC (C++) NoC Model (PhoenixSim) Fault Injection Address Translation Memory Model (DRAMSim2, FLASHsim, NVRAM)

  20. Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models Developed by Sandia Collaborators CoDEx project

  21. Simulator Infrastructure: Memorycycle accurate and energy-accurate models Validated against Micron DRAM HMC model coming this summer

  22. Simulator Infrastructurecycle accurate and energy-accurate models Rewrote Columbia PhoenixSim summer 2011 Orion-2 energy model Validated against Cornell test parts

  23. Simulator Infrastructurecycle accurate and energy-accurate models Full Gate-level RTL model of processor Well characterized energy model

More Related