MS108 Computer System I

MS108 Computer System I Lecture 15 Multicore Prof. Xiaoyao Liang 2014/5/16

Multi-Core Technology 2004 2005 2007 Single Core Dual Core Multi-Core 4 or more cores Cache 2X more cores + Cache + Cache Core 2 or more cores Cache + Cache Core + Cache 2010 Many Core

Many-core Era Massively Parallel Applications 100 Multi-core Era Scalar and Parallel Applications 10 Increasing HW Threads HT 1 2003 2005 2007 2009 2011 2013 Multiprocessing within a chip: Many-Core Intel predicts 100’s of cores on a chip in 2015

Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

Why multi-core ? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: • Heat problems • Clock problems • Efficiency (Stall) problems • Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult • issue 3 or 4 data memory accesses per cycle, • rename and access more than 20 registers per cycle, and • fetch 12 to 24 instructions per cycle. • Many new applications are multithreaded A general trend in computer architecture is to shift towards more parallelism through more processors or processor cores

What applications benefit from multi-core? • Database servers • Web servers (Web commerce)‏ • Multimedia applications • Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism(as opposed to instruction-level parallelism)‏ Each can run on itsown core

Thread-level parallelism (TLP)‏ • This is parallelism on a more coarse scale • Server can serve each client in a separate thread (Web server, database server)‏ • A computer game can do AI, graphics, and sound in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

Thread-Level Parallelism

Introduction • Thread-Level parallelism • Have multiple program counters • Uses MIMD model • Targeted for tightly-coupled shared-memory multiprocessors • For n processors, need n threads • Amount of computation assigned to each thread = grain size • Threads can be used for data-level parallelism, but the overheads may outweigh the benefit

How to exploit TLP? • Execute instructions from multiple threads on a single processor • Coarse-grain, fine-grain, SMT (Simultaneous Multi-Threading) • Execute multiple threads on multiple processors • “Anything that can be threaded today will map efficiently to multi-core”

SMT – Simultaneous Multi-Threading A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued in one cycle without regard to dependencies among them Need separate rename tables (ROBs) for each thread Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle

Coarse MT Fine MT SMT Issue slots Thread A Thread B Time Thread C Thread D TLP a 4-issue superscalar processor

Core 2 Duo Microarchitecture

Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point Schedulers Uop queues L2 Cache and Control Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point

Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point Schedulers Uop queues L2 Cache and Control Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2:integer operation

SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point Schedulers Uop queues L2 Cache and Control Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2:integer operation Thread 1: floating point

But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point Schedulers Uop queues L2 Cache and Control Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario isimpossible with SMTon a single core(assuming a single integer unit)‏ Bus BTB and I-TLB Thread 1 Thread 2 IMPOSSIBLE

Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point Schedulers Schedulers L2 Cache and Control Uop queues Uop queues L2 Cache and Control Rename/Alloc Rename/Alloc BTB Trace Cache uCode ROM BTB Trace Cache uCode ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 2 Thread 1

Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point Schedulers Schedulers L2 Cache and Control Uop queues Uop queues L2 Cache and Control Rename/Alloc Rename/Alloc BTB Trace Cache uCode ROM BTB Trace Cache uCode ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 4 Thread 3

Combining Multi-core and SMT • Cores can be SMT-enabled (or not)‏ • The different combinations: • Single-core, non-SMT: standard uniprocessor • Single-core, with SMT • Multi-core, non-SMT • Multi-core, with SMT • The number of SMT threads:2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads”

SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point Schedulers Schedulers L2 Cache and Control Uop queues Uop queues L2 Cache and Control Rename/Alloc Rename/Alloc BTB Trace Cache uCode ROM BTB Trace Cache uCode ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4

High-Performance Computing

Processor Performance • We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques): • Pipelining • ILP • Superscalars • Out-of-order execution (Scoreboarding, Tomasulo) • VLIW • Cache (L1, L2, L3) • Interleaved memories • Compilers (Loop unrolling, branch prediction, etc.) • RAID • Etc … • However, quite often even the best microprocessors are not fast enough for certain applications !!!

Example: How far will ILP go? • Infinite resources and fetch bandwidth, perfect branch prediction and renaming

When Do We Need High Performance Computing? Case1 To do a time-consuming operation in less time I am an aircraft engineer I need to run a simulation to test the stability of the wings at high speed I’d rather have the result in 5 minutes than in 5 days so that I can complete the aircraft final design sooner.

When Do We Need High Performance Computing? Case 2 To do a high number of operations per seconds I am an engineer of Amazon.com My Web server gets 10,000 hits per seconds I’d like my Web server and my databases to handle 10,000 transactions per seconds so that customers do not experience bad delays Amazon does “process” several GBytes of data per seconds

The need for High-Performance Computerssome examples • Automotive design: • Major automotive companies use large systems (500+ CPUs) for: • CAD-CAM, crash testing, structural integrity and aerodynamics. • Savings: approx. $1 billion per company per year. • Semiconductor industry: • Semiconductor firms use large systems (500+ CPUs) for • device electronics simulation and logic validation • Savings: approx. $1 billion per company per year. • Airlines: • System-wide logistics optimization systems on parallel systems. • Savings: approx. $100 million per airline per year.

structural biology vehicle dynamics pharmaceutical design 72-hour weather 48-hour weather chemical dynamics 3D plasma modelling 2D airfoil oil reservoir modelling Grand Challenges 1 TB 100 GB 10 GB 1 GB Storage Requirements 100 MB 10 MB 100 MFLOPS 1GFLOPS 10 GFLOPS 100 GFLOPS 1 TFLOPS Computational Performance Requirements

Weather Forecasting • Suppose the whole global atmosphere divided into cells of size 1 km  1 km  1 km to a height of 10 km (10 cells high) - about 5  108 cells. • Suppose each cell calculation requires 200 floating point operations. In one time step, 1011 floating point operations are necessary. • To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) – similar to the Pentium 4 - takes 106 seconds or over 10 days. • To perform calculation in 5 minutes requires a computer operating at 3.4 Tflops (3.4  1012 floating point operations/sec).

Multiprocessing • Multiprocessing (Parallel Processing): Concurrent execution of tasks (programs) using multiple computing, memory and interconnection resources. Use multiple resources to solve problems faster • Provides alternative to faster clock for performance • Assuming a doubling of effective processor performance every 2 years, 1024-Processor system (assuming linear performance gain) can get you the performance that it would take 20 years for a single-processor system to deliver • Using multiple processors to solve a single problem • Divide problem into many small pieces • Distribute these small problems to be solved by multiple processors simultaneously

Multiprocessing • For the last 30+ years multiprocessing has been seen as the best way to produce orders of magnitude performance gains. • Double the number of processors, get (theoretically) double performance (less than 2 times the cost). • It turns out that the ability to develop and deliver software for multiprocessing systems induces impediment to wide adoption.

Amdahl’s Law • A parallel program has a sequential part (e.g., I/O) and a parallel part • T1 = T1 + (1-)T1 • Tp = T1 + (1-)T1 / p • Therefore: Speedup(p) = 1 / ( + (1-)/p) = p / ( p + 1 - )  1 /  • Example: if a code is 10% sequential (i.e.,  = .10), the speedup will always be lower than 1 + 90/10 = 10, no matter how many processors are used!

Performance Potential Using Multiple Processors • Amdahl's Law is pessimistic (in this case) • Let s be the serial part • Let p be the part that can be parallelized n ways • Serial: SSPPPPPP • 6 processors: SSP • P • P • P • P • P • Speedup = 8/3 = 2.67 • T(n) = • As n  , T(n)  • Pessimistic 1 s+p/n 1 s

Amdahl’s Law Speedup 25 20 15 1000 CPUs 16 CPUs 4 CPUs 10 5 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% % Serial

Example

Performance Potential: Another view • Gustafson view (more widely adopted for multiprocessors) • Parallel portion increases as the problem size increases • Serial time fixed (at s) • Parallel time proportional to problem size (true most of the time) • Gustafson’s Law: Speedup(N) = N - (N-1) • N: number of processors, : weight of non-parallelizable part • Old Serial: SSPPPPPP • 6 processors: SSPPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • Hypothetical Serial: • SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP • Speedup(6) = (8+5*6)/8 = 4.75 •  =? in this calculation? • Speedup(N) = N(1- ) + ; Speedup'() !!!!

Amdahl vs. Gustafson-Barsis Speedup 100 80 Gustafson-Barsis 60 Amdhal 40 20 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% % Serial

TOP 5 Most computers in the world – must be multiprocessors Nov. 2012 2008 http://www.top500.org/

A Few Types • Symmetric multiprocessors (SMP) • Small number of cores • Share single memory with uniform memory latency • Distributed shared memory (DSM) • Memory distributed among processors • Non-uniform memory access/latency (NUMA) • Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks

Case Study Intel i7-860 Nehalem 32KB L1 Data Cache 32KB L1 Data Cache 32KB L1 Data Cache 32KB L1 Data Cache 32KB L1 Instr Cache 32KB L1 Instr Cache 32KB L1 Instr Cache 32KB L1 Instr Cache Proc Proc Proc Proc • Support for SSE 4.2 SIMD instruction set • 8-way hyperthreading (executes two threads per core) • Multiscalar execution (4-way issue per thread) 256KB L2 Unified Cache 256KB L2 Unified Cache 256KB L2 Unified Cache 256KB L2 Unified Cache Bus (Interconnect) Shared 8MB L3 Cache Up to 16 GB Main Memory (DDR3 Interface)

Case Study Memory Controller Memory Controller Memory Controller Memory Controller Sun UltraSparc T2 Niagara 512KB L2 C$ 512KB L2 C$ 512KB L2 C$ 512KB L2 C$ 1 512KB L2 C$ 512KB L2 C$ 52KB L2 C$ 512KB L2 C$ Full Cross-Bar (Interconnect) • FPU • Support for VIS 2.0 SIMD instruction set • 64-way multithreading (8-way per processor, 8 processors) Proc Proc Proc Proc Proc Proc Proc Proc FPU FPU FPU FPU FPU FPU FPU FPU

Case Study Nvidia Fermi GPU

Case Study • Apple A5X SoC • 2 ARM cores • 4 GPU cores • 2 GPU Primitive Engines • Other SoC components: WIFI, Video, Audio, DDR controller, etc.

Case Study Year

Case Study • Jaguar • Oak Ridge National Laboratories • Cray XT5 supercomputer • 2.33 Petaflops Peak • 1.76 Petaflops Sustain • 224256 AMD Opteron cores • 3-dimensional toroidal mesh

Case Study • Tianhe (TH-1A) • Chinese National University of Defense Technology • 4.7 Petaflops Peak • 2.5 Petaflops Sustain • 14336 Intel X5760 (6-core) CPUs • 7168 Nvidia M2050 Fermi GPUs

Case Study • IBM Sequoia • IBM+ Lawrence Livermore National Laboratory • 20.1 Petaflops Peak • 16.32 Petaflops Sustain • 1572864 IBM PowerPC cores • 6MW of power !! • Huge amount of heat !!!

Case Study • Google Data Center • Power: build data center close to power source (Sun, wind, etc.) • Cooling: sea water cooled data center

Flynn’s Taxonomy classic von Neumann not covered

MS108 Computer System I