Embracing Parallelism in Modern Computer Architecture

EE (CE) 6304 Computer ArchitectureLecture #3(8/29/17) Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas Course Web-site: http://www.utdallas.edu/~gxm112130/EE6304FA17

Have we reached the end of ILP? • Multiple processor easily fit on a chip • Every major microprocessor vendor has gone to multithreaded cores • Thread: loci of control, execution context • Fetch instructions from multiple threads at once, throw them all into the execution unit • Intel: hyperthreading • Concept has existed in high performance computing for 20 years (or is it 40? CDC6600) • Vector processing • Each instruction processes many distinct data • Ex: MMX • Raise the level of architecture – many processors per chip Tensilica Configurable Proc

Limiting Forces: Clock Speed and ILP • Chip density is continuing increase ~2x every 2 years • Clock speed is not • # processors/chip (cores) may double instead • There is little or no more Instruction Level Parallelism (ILP) to be found • Can no longer allow programmer to think in terms of a serial programming model • Conclusion:Parallelism must be exposed to software! Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

P P P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host P/M P/M P/M P/M Network Examples of MIMD Machines • Symmetric Multiprocessor • Multiple processors in box with shared memory communication • Current MultiCore chips like this • Every processor runs copy of OS • Non-uniform shared-memory with separate I/O through host • Multiple processors • Each with local memory • general scalable network • Extremely light “OS” on node provides simple services • Scheduling/synchronization • Network-accessible host for I/O • Cluster • Many independent machine connected with general network • Communication through messages

Categories of Thread Execution Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

The Memory Abstraction • Association of <name, value> pairs • typically named as byte addresses • often values aligned on multiples of size • Sequence of Reads and Writes • Write binds a value to an address • Read of addr returns most recently written value bound to that address command (R/W) address (name) data (W) data (R) done

Memory Hierarchy • Take advantage of the principle of locality to: • Present as much memory as in the cheapest technology • Provide access at speed offered by the fastest technology Processor Tertiary Storage (Tape/Cloud Storage) Control Secondary Storage (Disk/FLASH/PCM) Main Memory (DRAM/FLASH/ PCM) Second Level Cache (SRAM) On-Chip Cache Datapath Registers 10,000,000s (10s ms) Speed (ns): 1s 10s-100s 100s 10,000,000,000s (10s sec) Size (bytes): 100s Ks-Ms Ms Gs Ts

MEM P $ The Principle of Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 30 years, HW relied on locality for speed

Example of modern core: Nehalem • ON-chip cache resources: • For each core: L1: 32K instruction and 32K data cache, L2: 1MB • L3: 8MB shared among all 4 cores • Integrated, on-chip memory controller (DDR3)

P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem Memory Abstraction and Parallelism • Maintaining the illusion of sequential access to memory across distributed system • What happens when multiple processors access the same memory at once? • Do they see a consistent picture? • Processing and processors embedded in the memory?

Proc Caches Busses adapters Memory Controllers Disks Displays Keyboards I/O Devices: Networks Is it all about communication? Pentium IV Chipset

Breaking the HW/Software Boundary • Moore’s law (more and more trans) is all about volume and regularity • What if you could pour nano-acres of unspecific digital logic “stuff” onto silicon • Do anything with it. Very regular, large volume • Field Programmable Gate Arrays • Chip is covered with logic blocks w/ FFs, RAM blocks, and interconnect • All three are “programmable” by setting configuration bits • These are huge? • Can each program have its own instruction set? • Do we compile the program entirely into hardware?

Number Crunching Data Storage productivity interactive “Bell’s Law” – new class per decade log (people per computer) streaming information to/from physical world • Enabled by technological opportunities • Smaller, more numerous and more intimately connected • Brings in a new kind of application • Used in many ways not previously imagined year

It’s not just about bigger and faster! • Complete computing systems can be tiny and cheap • System on a chip • Resource efficiency • Real-estate, power, pins, …

Understanding & Quantifying Cost,Performance, Power, Dependability & Reliability

Integrated Circuit Cost • Integrated circuit • Bose-Einstein formula: • Defects per unit area = 0.016-0.057 defects per square cm (2010) • N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

DC to Paris Speed Passengers Throughput (pmph) 6.5 hours 610 mph 470 286,700 3 hours 1350 mph 132 178,200 Which is faster? Plane Boeing 747 BAD/Sud Concorde • Time to run the task (ExTime) • Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) • Throughput, bandwidth

performance(x) = 1 execution_time(x) Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X) Definitions • Performance is in units of things per sec • bigger is better • If we are primarily concerned with response time • " X is n times faster than Y" means

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPI Processor performance equation inst count Cycle time Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

Cycles Per Instruction(Throughput) “Average Cycles per Instruction” • CPI = (CPU Time * Clock Rate) / Instruction Count • = Cycles / Instruction Count “Instruction Frequency”

Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Example: Branch Stall Impact • Assume CPI = 1.0 ignoring branches (ideal) • Assume branch was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% • Op Freq Cycles CPI(i) (% Time) • Other 70% 1 .7 (37%) • Branch 30% 4 1.2 (63%) • => new CPI = 1.9 • New machine is 1/1.9 = 0.52 times faster (i.e. slow!)

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

Making common case fast • Many a time an architect spends tremendous effort and time to optimize some aspect of system • Later realize that overall speedup is unrewarding • So, better to measure the usage of that aspect of system, before attempt to optimize it • In making a design trade-off • Favor the frequent case over the infrequent case • In allocating additional resources • Allocate to improve frequent event, rather than a rare event So, what principle quantifies this scenario?

Amdahl’s Law Best you could ever hope to do:

Amdahl’s Law example • New CPU 10X faster • I/O bound server, so 60% time waiting for I/O • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power • For mobile devices, energy better metric • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit)

Example of quantifying power • Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power?

Define and quantify power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too • Leakage current increases in processors with smaller transistor sizes • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage was 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage

Power and Energy • Energy to complete operation (Joules) • Corresponds approximately to battery life • (Battery energy capacity actually depends on rate of discharge) • Peak power dissipation (Watts = Joules/second) • Affects packaging (power and ground pins, thermal design) • di/dt, peak change in supply current (Amps/second) • Affects power supply noise (power and ground pins, decoupling capacitors)

Peak Power versus Lower Energy Peak A • System A has higher peak power, but lower total energy • System B has lower peak power, but higher total energy Peak B Power Integrate power curve to get energy Time

Define and quantify dependability (1/3) • How decide when a system is operating properly? • Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: • Service accomplishment, where the service is delivered as specified in SLA • Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1

Define and quantify dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics • Mean Time To Failure (MTTF) measures Reliability • Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation • Mean Time To Repair (MTTR) measures Service Interruption • Mean Time Between Failures (MTBF) = MTTF+MTTR • Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) • Module availability = MTTF / ( MTTF + MTTR)

Example calculating reliability • If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules • Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

Embracing Parallelism in Modern Computer Architecture

Embracing Parallelism in Modern Computer Architecture

Presentation Transcript

EE for CE Course Outlines MID TERM I

BS in EE/CE/ECE

George Kesidis – Professor, CSE and EE

EE(EE+)