Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce

Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab SudhakarYalamanchili: Georgia Tech Data Movement Dominates (DMD) and CoDEx: CoDesign for Exascale

CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation

CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)

CodesignTools RecapArchitectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CAL: (Sandia/LBL) Computer Architecture Laboratory SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)

Fidelity vs. Scope for Architectural Simulation Methods

ROSE CompilerFull Program Understanding through Deep Source-Code Analysis

ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction • Can automatically predict performance for many input codes and software optimizations • Predict performance under different architectural scenarios • Much faster than hardware simulation and manual modeling Performance Prediction Spreadsheet Machine Parameters Performance Model Combustion Codes Compiler Analysis <XML> Dependency Graph Optimization User Parameters

SST/macro: Coarse-Grained Simulation An application code with minor modifications SST/Macro Impl. of interfaces (MPI), which simulate execution and communication

SST/micro: Cycle-Accurate Framework • Has a general simulation framework for integrating models • Simulation backend is parallel • Plenty of people involved

Some Models Currently Integrated Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components. MacSimprovides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces. IRISprovides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSimmodels photonic networks.

Leveraging Embedded Design AutomationFor Design Space Exploration This stuff is essential!

Embedded Design Automation(Using FPGA emulation to do rapid prototyping) RAMP FPGA-accelerated Emulation of ASIC Or “tape out” To FPGA

Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology • Data movement are projected to account for over 75% of power budget for an exascale platform • Work to reduce that via • Optical interconnect(s) • 3D stacking (logic + memory + optics) • New memory protocols Research Questions • What is the performance potential of stacked memory (power & speed) • How much intelligence to put into logic layer • Atomics, gather/scatter, checksums, full-processor-in-memory • What is the memory consistency model for intelligent DRAM • How to program it if we put embed more intelligence into DRAM

The Cost of Moving Data

Locality Management is KeyWhat are the best combination of software and hardware mechanisms to maximize data movement efficiency Vertical Locality Management Horizontal Locality Management Sun Microsystems Temporal Topological

Why Study Chip Stacking (TSVs)?Energy = (V 2 ∗ C) ∗ Overhead + Ecomm DRAM Cells Efficient TSVs Reduce Costs TSVs orders of magnitude less energy –250 fJ/bit for reading DRAM –5 fJ/bit for TSV –250 fJ/bit for mem. controller –~0.5 pJ/bit (compared to 30pJ for conventional DIMM) –Don’t have to access more data than needed • Enables....–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation • Requires –...changes to how we view the machine & the memory • DRAM cells require < 1 pJ to access • Current DRAM architectures are not power efficient • Long distances ➔ high power • We pay for more than we get at every level • Cache: throw away 75-80% • DRAM Row: Charge 1024B for each 64B access • DIMM: Charge 8-9 chips/access • ~800 pJ/byte total • DRAM design driven by packaging constraints • ~50% of DRAM chip cost is packaging, mainly in pins • DIMMs use multiple chips with a few data pins to achieve high BW

Why Photonics? Photonics changes the rules for Bandwidth-per-Watt. ELECTRONICS: • Buffer, receive, and re-transmitat every router. • Space Parallelism:Each bus lane routed independently (P  NLANES). • Off-chip BW requires much more power than on-chip BW. Photonics: • Modulate/receive data stream once per communication event. • Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream. • Off-chip BW ≈ on-chip BW for nearly same power. RX RX RX RX RX TX RX TX TX TX TX TX

Why Optically-Connected Memory? Traditional Memory Optically-Connected Memory HBDRAM HBDRAM HBDRAM HBDRAM CPU CPU HBDRAM HBDRAM HBDRAM HBDRAM Electronic Bus Optical Link • Large Pin-out • Complex wiring • Low bandwidth density • Distance constrained by electrical limitations • High power dissipation • All-optical link, no electronic bus to drive • Bit-rate transparent link • High bandwidth density, less pins • Distance immunity at computer scale • Low power dissipation Will not scale to meet power and bandwidth requirements of future high-performance computing systems Enables scaling of high-performance computing through increased memory capacity and bandwidth

Mixed Model Simulationcycle accurate and energy-accurate models MPI Traces (DUMPI) kernels skeleton app (C, C++, Fortran) Processor Model (SST/micro & Tensilica) SST/macro Workload Translation Checkpoint/restart SystemC (C++) NoC Model (PhoenixSim) Fault Injection Address Translation Memory Model (DRAMSim2, FLASHsim, NVRAM)

Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models Developed by Sandia Collaborators CoDEx project

Simulator Infrastructure: Memorycycle accurate and energy-accurate models Validated against Micron DRAM HMC model coming this summer

Simulator Infrastructurecycle accurate and energy-accurate models Rewrote Columbia PhoenixSim summer 2011 Orion-2 energy model Validated against Cornell test parts

Simulator Infrastructurecycle accurate and energy-accurate models Full Gate-level RTL model of processor Well characterized energy model

Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce

Arun Rodrigues, Scott Hemmert , Dave Resnick : Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce

Presentation Transcript

Jacob Kounin: Instructional Management

Headspace BOD (HBOD): A Rapid and Practical Alternative to the Conventional BOD Test

Jacob Kounin: Instructional Management

How to measure discount rates?

The Nanoantenna Prospect for Harvesting Energy

The Jacob Narrative

Internet Addiction

A tal e of two networks – network neutrality and other topics

University of Maryland Extension

Nicole M. Fortin Department of Economics and CIFAR University of British Columbia March 2009

Polymers/Organic Materials Aging Overview

Halliday/Resnick/Walker Fundamentals of Physics 8 th edition

P. Saddayappan 2 Jarek Nieplocha 1 , Bruce Palmer 1 , Manojkumar Krishnan 1 , Vinod Tipparaju 1

Process Equipment Inspection and Testing

Halliday/Resnick/Walker Fundamentals of Physics 8 th edition

Richard P. Barth School of Social Work University of Maryland

presented by Wayne Einfeld and Gary Brown Sandia National Laboratories

Bruce Damer, bruce@damer

Economic Returns from the Biosphere

Child Passenger Safety: Update on Maryland CPS Programs