Working Group 1 Enabling Technologies Chair: Sheila Vaidya Vice Chair: Stu Feldman

Working Group 1Enabling TechnologiesChair: Sheila VaidyaVice Chair: Stu Feldman

WG 1 – Enabling TechnologiesCharter • Charter • Establish the basic technologies that may provide the foundation for important advances in HEC capability, and determine the critical tasks required before the end of this decade to realize their potential. Such technologies include hardware devices or components and the basic software approaches and components needed to realize advanced HEC capabilities. • Chair • Sheila Vaidya, Lawrence Livermore National Laboratory • Vice-Chair • Stuart Feldman, IBM

WG 1 – Enabling TechnologiesGuidelines and Questions • As input to HECRTF charge (1a), Please provide information about key technologies that must be advanced to strengthen the foundation for developing new generations of HEC systems. Include discussion of promising novel hardware and software technologies with potential pay-off for HEC • Provide brief technology maturity roadmaps and investments, with discussion of costs to develop these technologies • Discuss technology dependencies and risks (for example, does the roadmap depend on technologies yet to be developed?) • Example topics: • semiconductors, memory (e.g. MRAM), networks (e.g. optical), packaging/cooling, novel logic devices (e.g. RSFQ), alternative computing models

Kamal Abdali, NSF Fernand Bedard, NSA Herbert Bennett, NIST Ivo Bolsens, XILINX Jon Boyens, DOC Bob Brodersen, UC Berkeley Yolanda Comedy, IBM Loring Craymer, JPL Bronis R. de Supinski, LLNL Martin Deneroff, SGI Stuart Feldman, IBM (VICE-CHAIR) Sue Fratkin, CASC David Fuller, JNIC/Raytheon Gary Hughes, NSA Tyce McLarty, LLNL Kevin Martin, Georgia Tech Virginia Moore, NCO/ITRD Ahmed Sameh, Purdue John Spargo, Norhrop-Grumman William Thigpen, NASA Sheila Vaidya, LLNL (CHAIR) Uzi Vishkin, U Maryland Steven Wallach, Chiaro Working Group Participants

Timescales • 0-5 years • Suitable for deployment in high-end systems within next 5 years • Implies that the technology has been tried and tested in a systems context • Requires additional investment beyond commercial industry • 5-10 years • Suitable for deployment in high-end systems in 10 years • Implies that the component has been studied and feasibility shown • Requires system embodiment and growing investment • 10+ years • New research, not yet reduced to practice • Usefulness in systems not yet demonstrated

Passive 0-5 Optical networking Serial optical interface 5-10 High-density optical networking Optical packet switching 10+ Scalability (node density, bandwidth) Active 0-5 Electronic cross-bar switch Network processing on board 5-10 Data Vortex Superconducting cross-bar switch 10+ Interconnects

Power/Thermal Management, Packaging • 0-5 • Optimization for power efficiency • 2.5-D packaging • Liquid cooling (e.g., spray) • 5-10 • 3-D packaging and cooling (microchannel) • Active temperature response • 10+ • Higher scalability concepts (improving OPS/W)

Single Chip Architecture • 0-5 • Power-efficient designs • System on Chip; Processor-in-Memory • Reconfigurable circuits • Fine-grained irregular parallel computing • 5-10 • Adaptive architecture • Optical clock distribution • Asynchronous designs • 10+

Main Memory 0-5 Optimized memory hierarchy Smart memory controllers 5-10 3-D memory (e.g., MRAM) 10+ Nanoelectronics Molecular electronics Storage & I/O 0-5 Object-based storage Remote DMA I/O controllers (MPI, etc.) 5-10 Software for “cluster” storage access to MRAM, holographic, MEMS, STM, E-beam 10+ Spectral hole burning Molecular electronics Memory

Device Technologies • 0-5 • Silicon on Insulator, SiGe, mixed III-V devices • Integrated electro-optic and high-speed electronics • 5-10 • Low-temperature CMOS • Superconducting - RSFQ • 10+ • Nanotechnologies • Spintronics

Algorithms, SW-HW Tools • 0-5 • Compiler innovations for new architectures • Tools for robustness (e.g., delay, fault tolerance) • Low-overhead coordination mechanisms • Performance monitors • Sparse matrix innovations • 5-10 • Very High Level Language hardware support • Real-time performance monitoring and feedback • PRAM (Parallel Random Access Machine model) • 10+ • Ideas too numerous to select

Generic Needs • Sharing • NNIN-like consortia • National Nanotechnology Infrastructure Network • Custom hardware production • Intellectual Property policies (open?) • Tools for • Design for Testability • Physical design • Testing and Verification • Simulation • Programmability

High-Impact Themes • 0-5 • Show value of HEC solutions to the commercial sector • Facilitate sharing and collaboration across HEC community • Technology • Power/thermal management • Optical networking • 5-10 • Long-term consistent investment in HEC • Technology • 3-D Packaging • New devices (MRAM, MEMS, RSFQ) • Power/thermal management & Optical – Ongoing • 10+ years • Continued research for HEC

Working Group 2COTS-Based Architecture Chair: Walt Brooks Vice Chair: Steve Reinhardt

WG2 – Architecture: COTS-based Charter • Charter • Determine the capability roadmap of anticipated COTS-based HEC system architectures through the end of the decade. Identify those critical hardware and software technology and architecture developments, required to both sustain continued growth and enhance user support. • Chair • Walt Brooks, NASA Ames Research Center • Vice-Chair • Steve Reinhart, SGI

WG2 – Architecture: COTS-based Guidelines and Questions • Identify opportunities and challenges for anticipated COTS-based HEC systems architectures through the decade and determine its capability roadmap. • Include alternative execution models, support mechanisms, local element and system structures, and system engineering factors to accelerate rate of sustained performance gain (time to solution), performance to cost, programmability, and robustness. • Identify those critical hardware and software technology and architecture developments, required to both sustain continued growth and enhance user support. • Example topics: • microprocessors, memory, wire and optical networks, packaging, cooling, power distribution, reliability, maintenance, cost, size

Working Group Participants • Walt Brooks(chair) • Rob Schreiber(L) • Yuefan Deng • Steven Gottlieb • Charles lefurgy • John Ziebarth • Stephen Wheat • Guang R. Gao • Burton Smith • Steve Reinhardt (co-chair) • Bill Kramer(L) • Don Dossa • Dick Hildebrandt • Greg Lindahl • Tom McWilliams • Curt Janseen • Erik DeBenedicttis

Assumptions/Definitions • Definition of “COTS based” • Using systems originally intended for enterprise or individual use • Building Blocks-Commodity processors, commodity memory and commodity disks • Somebody else building the hardware and you have limited influence over • Examples • IN-Redstorm, Blue Planet, Altix • OUT-X1, Origins, SX-6 • Givens • Massive disk storage (object stores) • Fast wires (SERDES-driven) • Heterogeneous systems (processors)

Primary Technical Findings • Improve memory bandwidth • We have to be patient in the short term for the next 2-3 years the die has been cast • Sustained Memory bandwidth is not increasing fast enough • Judicious investment in the COTS vendors to effect 2008 • Improve the Interconnects-”connecting to the interconnect” • Easier to influence than memory bandwidth • Connecting through I/O is too slow we need to connect to CPU at memory equivalent speeds • One example is HyperTransport which represents a memory grade interconnect in terms of bandwidth and is a well defined I/F -others are under development • Provide ability for heterogeneous COTS based systems. • E.g. -FPGA, ASIC,… in the fabric • FPGA allows tightly coupled research on emerging execution models and architectural ideas without going to foundry • Must have the software to support programming ease for FPGA

Technology Influence DirectLead Time IndirectLead Time Indirect Cost to Design Direct Cost to Design $0.2M $10M 1 year I/O $50M Nodes/Frames Board Board/Components Interconnect $5M CPU/Chips $300- 1,000M 4-6 years 10-15 years $5 - 100M

Programmatic Approaches • Develop a Government wide coordinated method for direct Influence with the vendors to make “designs” changes • Less influence with COTS mfrs, more with COTS-based vendors • Recognize that commercial market is the primary driver for COTS • “Go” in early • Develop joint Government. research objectives-must go to vendors with a short focused list of HEC priorities • Where possible find common interests with the industries that drive the commodity market • “Software”- we may have more influence- • Fund long Term Research • Academic research must have access to systems at scale in order to do relevant research • Strategy for moving University research into the market • Government must be an early adopter • risk sharing with emerging systems

Software Issues • Not clear that these are part of our charter but would like to be sure they are handled • Scaling “Linux” to 1000’s of processors • Administrated at full scale for capability computing • Scalable File systems • Need Compiler work to keep pace • Managing Open Source • Coordinating release implementation • Open source multi-vendor approach-O/S,Languages,Libraries, debuggers… • Overhead of MPI is going to swamp the interconnect and hamper scaling • Need a lower overhead approach to Message Passing

Parallel Computing • Parallel computing is (now) the path to speed • People think the problem is solved but it’s not • Need new benchmarks that expose true performance of COTS • If the government is willing to invest early even at the chip level there is the potential to influence design in a way that makes scaling “commodity” systems easier • Parallel computers to be much more general purpose than they are today • More useful, easier to use, and better balanced • Continued growth of computing may depend on it • To get significantly more performance, we must treat parallel computing as first class • COTS processors especially will be influenced only by a generally applicable approach

Themes From White Papers • Broad Themes • Exploit Commodity • One system doesn’t fit all applications-For specific family of codes Commodity can be a good solution • unique topology and algorithmic approaches allow exploitation of current technology • Novel uses of current technology(Overlap with Panel 3) • RCM Technology- FPGA faster, lower power with multiple units-Hybrid FPGA-core is the traditional processor on chip with logic units-Need H/W architect for RCM-Apps suitable for RCM-RCM are about ease of programming • Streaming technology utilizing commercial chips • Fine grained multi threading • Supporting Technology( Overlap with panel 1) • Self managing Self Aware systems • MRAM,EUVL,Micro-channel • Power Aware Computing • High end interconnect and scalable files systems • High performance interconnect technology, optical and others that can scale to large systems • Systems software that scales up gracefully to enormous processor count with reliability,efficiency and and ease of • There is a natural layering of technologies involved in a high-performance machine: • the basic silicon, • the cell boards and shared memory nodes, the cluster interconnect, the racks, the cooling, the OS kernel, • the added OS services, the runtime libraries, the compilers and languages, the application libraries.

Relevant White Papers 18 of the 64/80 papers have some relevance to our topic • 6 • 10 • 12 • 16 • 17 • 31 • 33 • 39 • 45 • 46 • 47 • 50 • 65 • 68 • 72 • 75 • 80

Working Group 3:Custom-Based Architectures Chair: Peter Kogge Vice Chair: Thomas Sterling

WG3 – Architecture: Custom based Charter • Charter • Identify opportunities and challenges for innovative HEC system architectures, including alternative execution models, support mechanisms, local element and system structures, and system engineering factors to accelerate rate of sustained performance gain (time to solution), performance to cost, programmability, and robustness. Establish a roadmap of advanced-concept alternative architectures likely to deliver dramatic improvements to user applications through the end of the decade. Specify those critical developments achievable through custom design necessary to realize their potential. • Chair • Peter Kogge, Notre Dame • Vice-Chair • Thomas Sterling, California Institute of Technology & Jet Propulsion Laboratory

WG3 – Architecture: Custom based Guidelines and Questions • Present driver requirements and opportunities for innovative architectures demanding custom design • Identify key research opportunities in advanced concepts for HEC architecture • Determine research and development challenges to promising HEC architecture strategies. Project brief roadmap of potential developments and impact through the end of the decade. • Specify impact and requirements of future architectures on system software and programming environments. • Example topics: • System-on-a-chip (SOC), Processor-in-memory (PIM), streaming, vectors, multithreading, smart networks, execution models, efficiency factors, resource management, memory consistency, synchronization

Duncan Buell, U. So. Carolina George Cotter, NSA William Dally, Stanford Un. James Davenport, BNL Jack Dennis, MIT Mootaz Elnozahy, IBM Bill Feiereisen, LANL Michael Henesey, SRC Computers David Fuller, JNIC David Kahaner, ATIP Peter Kogge, U. Notre Dame Norm Kreisman, DOE Grant Miller, NCO Jose Munoz, NNSA Steve Scott, Cray Vason Srini, UC Berkeley Thomas Sterling, Caltech/JPL Gus Uht, U. RI Keith Underwood, SNL John Wawrzynek, UC Berkeley Working Group Participants

Charter (from Charge) • Identify opportunities & challenges for innovative HEC system architectures, including • alternative execution models, • support mechanisms, • local element and system structures, • and system engineering factors to accelerate • rate of sustained performance gain (time to solution), • performance to cost, • programmability, • and robustness. • Establish roadmap of advanced-concept alternative architectures likely to deliver dramatic improvements to user applications through the end of the decade. • Specify those critical developments achievable through custom design necessary to realize their potential.

Original Guidelines and Questions • Present driver requirements and opportunities for innovative architectures demanding custom design • Identify key research opportunities in advanced concepts for HEC architecture • Determine research and development challenges to promising HEC architecture strategies. • Project brief roadmap of potential developments and impact through the end of the decade. • Specify impact and requirements of future architectures on system software and programming environments. • (new) What role should/do universities play in developments in this area

Outline • What is Custom Architecture (CA) • Endgame Objectives, Benefits, & Challenges • Fundamental Opportunities Delivered by CA • Road Map • Summary Findings • Difficult fundamental challenges • Roles of Universities

What Is Custom Architecture? • Major components designed explicitly and system balanced for support of scalable, highly parallel HEC systems • Exploits performance opportunities afforded by device technologies through innovative structures • Addresses sources of performance degradation (inefficiencies) through specialty hardware and software mechanisms • Enable higher HEC programming productivity through enhanced execution models • Should incorporate COTS components where useful without sacrifice of performance

Endgame Objectives • Enable solution of • Problems we can’t solve now • And larger versions of ones we can solve now • Base economic model: provides 10 – 100X ops/Lifecycle $ AT SCALE • Vs inefficiencies of COTS • Significant reduction in real cost of programming • Focus on sustained performance, not peak

Strategic Benefits • Promotes architecture diversity • Performance: ops & bandwidth over COTS • Peak: 10X – 100X through FPU proliferation • Memory bandwidth 10X-100X through network and signaling technology • Focus on sustainable • High Efficiency • Dynamic latency hiding • High system bandwidth and low latency • Low overhead • Enhanced Programmability • Reduced barriers to performance tuning • Enables use of programming models that simplify programming and eliminate sources of errors • Scalability • Exploits parallelism at all levels of parallelism • Cost, size, and power • High compute density

Challenges To Custom • Small market and limited opportunity to exploit economy of scale • Development lead time • Incompatibility with standard ISAs • Difficulty of porting legacy codes • Training of users in new execution models • Unproven in the field • Need to develop new software infrastructure • Less frequent technology refresh • Lack of vendor interest in leading edge small volumes

Fundamental Technical OpportunitiesEnabled by CA • Enhanced Locality – Increasing Computation/Communication Demand • Exceptional global bandwidth • Architectures that enable utilization of global bandwidth • Execution models that enable compiler/programmer to use the above

Enhanced Locality – Increasing Computation/Communication Demand Mechanisms • Spatial computation via reconfigurable logic • Streams that capture physical locality by observing temporal locality • Vectors – scalability and locality microarchitecture enhancements • PIM – capture spatial locality via high bandwidth local memory (low latency) • Deep and explicit register & memory hierarchies • With software management of hierarchies Technologies • Chip stacking to increase local B/W

Providing Exceptional Global Bandwidth Mechanisms: • High radix networks • Non-blocking, bufferless topologies • Hardware congestion control • Compiler scheduled routing Technologies: • High speed signaling (system-oriented) • Optical, electrical, heterogeneous (e.g. VCSEL) • Optical switching & routing • High bandwidth memory device, high density Notes: • Routing & flow control are nearing optimal

Architectures that Enable Use of Global Bandwidth Note: This addresses providing the traffic stream to utilize the enhanced network • Stream and Vectors • Multi-threading (SMT) • Global shared memory (a communication overhead reducer) • Low overhead message passing • Augmenting microprocessors to enhance additional requests (T3E, Impulse) • Prefetch mechanisms

Execution Models Note: A good model should: • Expose parallelism to compiler & system s/w • Provide explicit performance cost model for key operations • Not constrain ability to achieve high performance • Ease of programming • Spatial direct mapped hardware • Resource flow • Streams • Flat vs Dist. Memory (UMA/NUMA vs M.P.) • New memory semantics • CAF and UPC, first good step • Low overhead synchronization mechanisms • PIM-enabled: Traveling threads, message-driven, active pages, ...

Roadmap: When to Expect CA Deployment • 5 Years or less • Must have relatively mature support s/w (and/or “friendly users”) • 5-10 years • Still open research issues in tools & system s/w • Approaching 10 years if requires mind set change in applications programmers • 10-15 years: • After 2015 all that’s left in silicon is architecture

Roadmap - 5 Year Period • Significant research prototype examples • Berkeley Emulation Engine: $0.4M/TF by 2004 on Immersed Boundary method codes • QCDOC: $1M/TF by 2004 • Merrimac Streaming: $40K/TF by 2006 • Note: several companies are developing custom architecture roadmaps

Roadmap - 5 Years or LessTechnologies Ready for Insertion • High bandwidth network technology can be inserted • No software changes • SMT: will be ubiquitous within 5 years • But will vendors emphasize single thread performance in lieu of supporting increased parallelism • Spatial direct mapped approach

Roadmap - 5 to 10 Years • All prior prototypes could be expanded to reach PF sustained at competitive recurring $ • Industry is targeting sustained Petaflops • If properly funded • Need to encourage transfer of research results • Virtually all of prior technology opportunities will be deployable • Drastic changes to programming will limit adoption

Roadmap: 10-15 Years • Silicon scaling at sunset • Circuit, packaging, architecture, and software opportunities remain • Need to start looking now at architectures that mesh with end of silicon roadmap and non-silicon technologies • Continue exponential scaling of performance • Radically different timing/RAS considerations • Spin out: how to use faulty silicon

Findings • Significant CA-driven opportunities for enhanced Performance/Programmability • 10-100X potential above COTS at the same time • Multiple, CA-driven innovations identified for near & medium term • Near term: multiple proof of concept • Medium term: deployment @ petaflops scale • Above potential will not materialize in current funding culture

Findings (2) • No one side of the community can realize opportunities of future Custom Architecture: • Strong peer-peer partnering needed between industry, national labs, & academia • Restart pipeline of HEC & parallel-oriented grad students & faculty • Creativity in system S/W & programming environments must support, track, & reflect creativity in HEC architecture

Findings (3) • Need to start now preparing for end of Moore’s Law and transition into new technologies • If done right, potential for significant trickle back to silicon

Fundamentally Difficult ChallengesTechnical • Newer applications for HEC • OS geared specifically to highly scaled systems • How to design HEC for upgradable • High Latency, low bandwidth ratios of memory chips and systems • File systems • Reliability with unreliable components at large scale • Fundamentally parallel ISAs

Working Group 1 Enabling Technologies Chair: Sheila Vaidya Vice Chair: Stu Feldman