The Future of Many Core Computing: A tale of two processors

The Future of Many Core Computing: A tale of two processors IT WAS THE BEST of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. Tim Mattson Intel Labs

Disclosure The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know nothing about Intel products. So anything I say about them is highly suspect. This was a team effort, but if I say anything really stupid, it’s all my fault … don’t blame my collaborators.

A common view of many-core chips An Intel Exec’s slide from IDF’2006

Challenging the sacred cows Is that the right choice? Many expert programmers do not fully understand relaxed consistency memory models required to make cache coherent architectures work. Programming models proven to scale non-trivial apps to 100’s to 1000’s of cores all based on distributed memory. Coherence incurs additional architectural overhead … fundamentally unscalable. Shared Cache Local Cache Streamlined IA Core Assumes cache coherent shared address space! … IA cores optimized for multithreading

Isn’t shared memory programming easier? Not necessarily. Extra work upfront, but easier optimization and debugging means overall, less time to solution Effort Message passing Time But difficult debugging and optimization means overall project takes longer initial parallelization can be quite easy Effort Multi-threading Time Proving that a shared address space program using semaphores is race free is an NP-complete problem* *P. N. Klein, H. Lu, and R. H. B. Netzer, Detecting Race Conditions in Parallel Programs that Use Semaphores, Algorithmica, vol. 35 pp. 321–345, 2003

The many core design challenge Scalable architecture: How should we connect the cores so we can scale as far as we need (O(100’s to 1000) should be enough)? Software: Can “general purpose programmers” write software that takes advantage of the cores? Will ISV’s actually write scalable software? Manufacturability: Validation costs grow steeply as the number of transistors grows. Can we use tiled architectures to address this problem? Validate a tile and the connections between tiles … drops validation costs from K*O(N) to K*O(N½) (warning, K can be very large). Intel’s “TeraScale” processor research program is addressing these questions with a series of Test chips … two so far. 48 core SCC processor 80 core Research processor

Agenda The 80 core Research Processor Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world

Intel’s 80 core terascale processor Die Photo and Chip Details • Basic statistics: • 65 nm CMOS process • 100 Million transistors in 275 mm2 • 8x10 tiles, 3mm2/tile • Mesosynchronous clock • 1.6 SP TFLOP @ 5 Ghz and 1.2 V • 320 GB/s bisection bandwidth • Variable voltage and multiple sleep states for explicit power management

The 80 core processor tile 2KB DATA MEMORY 3KB INSTR. MEMORY COMPUTE CORE: 2 FLOATING POINT ENGINES 5 PORT ROUTER • All memory “on tile” … 256 instructions, 512 floats, 32 registers • One sided anonymous msg passing into instruction or data memory • 2 FP units …. 4 flops/cycle/tile. 2 loads per cycle • No divide. No ints. 1D array indices, no nested loops This is an architecture concept that may or may not be reflected in future products from Intel Corp.

Programming Results Application Kernel Implementation Efficiency Actual Theoretical 1.4 Peak = 1.37 1.2 1 0.8 Single Precision TFLOPS @ 4.27 GHz 0.6 0.4 0.2 0 Stencil SGEMM Spread 2D FFT Sheet Theoretical numbers from operation/communication counts and from rate limiting bandwidths. 1.07V, 4.27GHz operation 80 C

Why this is so exciting! Intel’s ASCI Option Red First TeraScale* chip: 2007 First TeraScale* computer: 1997 10 years later Intel’s 80 core teraScale Chip 1 CPU 97 watt 275 mm2 Intel’s ASCI Red Supercomputer 9000 CPUs one megawatt of electricity. 1600 square feet of floor space. Single Precision TFLOPS running stencil Double Precision TFLOPS running MP-Linpack Source: Intel

Lessons: On-die memory is great 2 cycle latency compared to ~100 nsec for DRAM. Minimize message passing overhead. Routers wrote directly into memory without interrupting computing … i.e. any core could write directly into the memory of any other core. This led to extremely small comm. latency on the order of 2 cycles. Programmers can assist in keeping power low if sleep/wake instructions are exposed and if switching latency is low (~ a couple cycles). • Application programmers should help design chips • This chip was presented to us a completed package. • Small changes to the instruction set could have had a large impact on the programmability of the chip. • A simple computed jump statement would have allowed us to add nested loops. • A second offset parameter would have allowed us to program general 2D array computations.

Agenda The 80 core Research Processor Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world

SCC full chip • 24 tiles in 6x4 mesh with 2 cores per tile (48 cores total). 26.5mm SCC TILE DDR3 DDR3 MC MC PLL + JTAG I/O SCC 21.4mm TILE DDR3 DDR3 MC MC VRC System Interface + I/O

Hardware view of SCC 48 cores in 6x4 mesh with 2 cores per tile 45 nm, 1.3 B transistors, 25 to 125 W 16 to 64 GB total main memory using 4 DDR3 MCs Tile Tile Tile Tile Tile Tile Tile R R R R R R MC MC P54C (16KB each L1) 256KB L2 Tile Tile Tile Tile Tile Tile Traffic gen R R R R R R Tile Tile Tile Tile Tile Tile Mesh I/F CC Bus to PCI P54c FSB R R R R R R CC To router Tile Tile Tile Tile Tile Tile P54C (16KB each L1) 256KB L2 R R R R R R MC MC Message buffer Tile area: ~17 mm2 SCC die area: ~567 mm2 R = router, MC = Memory Controller, P54C = second generation Pentium core, CC = cache cntrl.

Programmer’s view of SCC 48 x86 cores with the familiar x86 memory model for Private DRAM 3 memory spaces, with fast message passing between cores ( / means on/off-chip) Shared off-chip DRAM (variable size) Private DRAM CPU_0 L2$ L1$ … t&s t&s Private DRAM CPU_47 L2$ L1$ Shared on-chip Message Passing Buffer (8KB/core) t&s Shared test and set register

RCCE: message passing library for SCC Treat Msg Pass Buf (MPB) as 48 smaller buffers … one per core. • Symmetric name space … Allocate memory as a collective op. Each core gets a variable with the given name at a fixed offset from the beginning of a core’s MPB. … 0 1 47 3 2 Flags allocated and used to coordinate memory ops A = (double *) RCCE_malloc(size) Called on all cores so any core can put/get(A at Core_ID) without error-prone explicit offsets 2

How does RCCE work? The foundation of RCCE is a one-sided put/get interface. Private DRAM Private DRAM CPU_47 CPU_0 L2$ L2$ L1$ L1$ Put(A,0) Get(A, 0) 0 • Symmetric name space … Allocate memory as a collective and put a variable with a given name into each core’s MPB. t&s t&s … 47 … and use flags to make the put’s and get’s “safe”

NAS Parallel benchmarks 2. LU: Pencil decomposition Define 2D-pipeline process. await data (bottom+left) compute new tile send data (top+right) x-sweep y-sweep z-sweep 4 4 3 4 4 3 2 3 4 4 3 2 2 3 4 1. BT: Multipartition decomposition • Each core owns multiple blocks (3 in this case) • update all blocks in plane of 3x3 blocks • send data to neighbor blocks in next plane • update next plane of 3x3 blocks UE 15 UE 7 UE 11 UE 3 UE 3 UE 2 UE 1 UE 0 1 Third party names are the property of their owners.

LU/BT NAS Parallel Benchmarks, SCC Problem size: Class A, 64 x 64 x 64 grid* • Using latency optimized, whole cache line flags * These are not official NAS Parallel benchmark results. SCC processor 500MHz core, 1GHz routers, 25MHz system interface, and DDR3 memory at 800 MHz. Third party names are the property of their owners. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference <http://www.intel.com/performance> or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

Power and memory-controller domains Voltage Memory RC package • Power ~ F V2 • Power Control domains (RPC): • 7 voltage domains … 6 4-tile blocks and one for on-die network. • 1 clock divider register per tile (i.e. 24 frequency domains) • One RPC register so can process only one voltage request at a time; other requestors block Tile Tile Tile Tile Tile Tile R R R R R R MC MC Tile Tile Tile Tile Tile Tile R R R R R R Tile Tile Tile Tile Tile Tile Bus to PCI R R R R R R Tile Tile Tile Tile Tile Tile MC R R R R R MC R Frequency

Power breakdown 23

Conclusions RCCE software works RCCE’s restrictions (Symmetric MPB memory model and blocking communications) have not been a fundamental obstacle SCC architecture The on-chip MPB was effective for scalable message passing applications Software controlled power management works … but it’s challenging to use because (1) granularity of 8 cores and (2) high latencies for voltage changes Future work The interesting work is yet to come … we will make ~100 of these systems available to industry and academic partners for research on: Scalable many core OS User friendly programming models that don’t depend on coherent shared memory

Agenda The 80 core Research Processor Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world Third party names are the property of their owners.

Software Hardware IT WAS THE BEST of times, it was the age of wisdom, it was the epoch of belief, it was the season of Light, it was the spring of hope, we had everything before us, we were all going direct to Heaven, it was the worst of times, it was the age of foolishness, it was the epoch of incredulity, it was the season of Darkness, it was the winter of despair, we had nothing before us, we were all going direct the other way The Future of Many Core Computing: A tale of two processors IT WAS THE BEST of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. Tim Mattson Intel Labs

The many-core challenge We have arrived at many-core solutions not because of the success of our parallel software but because of our failure to keep increasing CPU frequency. Result: a fundamental and dangerous mismatch Parallel hardware is ubiquitous. Parallel software is rare Our challenge … make parallel software as routine as our parallel hardware.

And remember … it’s the platform we care about, not just “the chip” Programmers need to make the best use of all the available resources from within a single program: One program that runs well (i.e. reasonably close to “hand-tuned” performance) on a heterogeneous mixture of processors. • A modern platform has: • CPU(s) • GPU(s) • DSP processors • … other? CPU CPU GMCH GPU ICH DRAM GMCH = graphics memory control hub, ICH = Input/output control hub

Solution: Find A Good parallel programming model, right? ABCPL ACE ACT++ Active messages Adl Adsmith ADDAP AFAPI ALWAN AM AMDC AppLeS Amoeba ARTS Athapascan-0b Aurora Automap bb_threads Blaze BSP BlockComm C*. "C* in C C** CarlOS Cashmere C4 CC++ Chu Charlotte Charm Charm++ Cid Cilk CM-Fortran Converse Code COOL CORRELATE CPS CRL CSP Cthreads CUMULVS DAGGER DAPPLE Data Parallel C DC++ DCE++ DDD DICE. DIPC DOLIB DOME DOSMOS. DRL DSM-Threads Ease . ECO Eiffel Eilean Emerald EPL Excalibur Express Falcon Filaments FM FLASH The FORCE Fork Fortran-M FX GA GAMMA Glenda GLU GUARD HAsL. Haskell HPC++ JAVAR. HORUS HPC IMPACT ISIS. JAVAR JADE Java RMI javaPG JavaSpace JIDL Joyce Khoros Karma KOAN/Fortran-S LAM Lilac Linda JADA WWWinda ISETL-Linda ParLin Eilean P4-Linda POSYBL Objective-Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat Legion Meta Chaos Midway Millipede CparPar Mirage MpC MOSIX Modula-P Modula-2* Multipol MPI MPC++ Munin Nano-Threads NESL NetClasses++ Nexus Nimrod NOW Objective Linda Occam Omega OpenMP Orca OOF90 P++ P3L Pablo PADE PADRE Panda Papers AFAPI. Para++ Paradigm Parafrase2 Paralation Parallel-C++ Parallaxis ParC ParLib++ ParLin Parmacs Parti pC PCN PCP: PH PEACE PCU PET PENNY Phosphorus POET. Polaris POOMA POOL-T PRESTO P-RIO Prospero Proteus QPC++ PVM PSI PSDM Quake Quark Quick Threads Sage++ SCANDAL SAM pC++ SCHEDULE SciTL SDDA. SHMEM SIMPLE Sina SISAL. distributed smalltalk SMI. SONiC Split-C. SR Sthreads Strand. SUIF. Synergy Telegrphos SuperPascal TCGMSG. Threads.h++. TreadMarks TRAPPER uC++ UNITY UC V ViC* Visifold V-NUS VPE Win32 threads WinPar XENOOPS XPC Zounds ZPL We learned more about creating programming models than how to use them. Please save us from ourselves … demand standards (or open source)! Models from the golden age of parallel programming (~95) Third party names are the property of their owners.

How to program the heterogeneous platform?Let History can be our guide … consider the origins of OpenMP … DEC HP SGI Merged, needed commonality across products IBM Cray Intel Wrote a rough draft straw man SMP API Other vendors invited to join KAI ISV - needed larger market was tired of recoding for SMPs. Forced vendors to standardize. ASCI 1997 Third party names are the property of their owners.

OpenCL: Can history repeat itself? Ericson Noikia AMD Merged, needed commonality across products IBM Sony ATI Blizzard GPU vendor - wants to steel mkt share from CPU Freescale Nvidia Wrote a rough draft straw man API TI Khronos Compute group formed + many more CPU vendor - wants to steel mkt share from GPU Intel was tired of recoding for many core, GPUs. Pushed vendors to standardize. Apple As ASCI did for OpenMP, Apple is doing for GPU/CPU with OpenCL Dec 2008 Third party names are the property of their owners.

Conclusion HW/SW co-design is the key to a successful transition to a many core future. HW is in good shape … SW is in a tough spot If you (the users) do not DEMAND good standards … our many core future will be uncertain. Our many core future A noble SW professional Tim Mattson getting clobbered in Ilwaco, Dec 2007

The software team Tim Mattson, Rob van der Wijngaart (Intel) Michael Frumkin (then at Intel, now at Google) Implementation Circuit Research Lab Advanced Prototyping team (Hillsboro, OR and Bangalore, India) PLL design Logic Technology Development (Hillsboro, OR) Package design Assembly Technology Development (Chandler, AZ) 80-core Research Processor teams A special thanks to our “optimizing compiler” … Yatin Hoskote, Jason Howard, and Saurabh Dighe of Intel’s Microprocessor Technology Laboratory.

SCC SW Teams • SCC Application software: • SCC System software: • And the HW-team that worked closely with the SW group: Jason Howard, Yatin Hoskote, Sriram Vangal, Nitin Borkar, Greg Ruhl

The Future of Many Core Computing: A tale of two processors