Advanced Processor Microarchitecture: Achieving High Performance - ADA Lecture

ADA: Advanced Architectures

A few information • A. Seznec, microarchitecture, 4 x 2h • S. Collange, GPU/accelerator architecture, 2 x 2h • S. Derrien, SoC + High Level Synthesis, 4 x 2h • Evaluation: • 2 h exam • Written synthesis of a research paper + oral presentation

ADA : Advanced Architecture Processor Microarchitecture Slides at https://team.inria.fr/alf/ada2017/

Microarchitecture • hardware organization of processor chips • At a macroscopic level • Nor at the transistor level neither at the gate level • But understanding the processor organization at the functional unit level

General Purposes Multicores

Moore’s « Law » • Nb of transistors on a microprocessor chip doubles every 18 months • 1972: 2000 transistors (Intel 4004) • 1979: 30000 transistors (Intel 8086) • 1989: 1 M transistors (Intel 80486) • 1999: 130 M transistors (HP PA-8500) • 2005: 1,7 billion transistors (Intel Itanium Montecito) • Processor performance doubles every 18 months • 1989: Intel 80486 16 Mhz (< 1inst/cycle) • 1993 : Intel Pentium 66 Mhz x 2 inst/cycle • 1995: Intel PentiumPro 150 Mhz x 3 inst/cycle • 06/2000: Intel Pentium III 1Ghz x 3 inst/cycle • 09/2002: Intel Pentium 4 2.8 Ghz x 3 inst/cycle • 09/2005: Intel Pentium 4, dual core 3.2 Ghz x 3 inst/cycles x 2 threads • 09/2008: Intel Nehalem quadcore, 3.2 Ghz x 3 (5) inst/cycle x 2 threads • 06/2013: Intel Haswell, up to 18 cores, 3.5 (3.9) Ghz x 3 (8) inst x 2 threads

This lecture • How to use the increasing number of transistors and the frequency gain to achieve HIGH PERFORMANCE • on a sequential program

What is performance ? • Depends on the effective user and the application • Servers: • increase the throughput • optimize the response time • Decrease the execution time on a fixed workload • Increase the workload on a fixed time interval • Execute some workload in a determined interval: real time • No need to execute faster, but guarantee the response time Not easy to manipulate: need some metrics

Possible performance metrics Clock frequency: Technology and microarchitecture dependent What is done in a clock cycle Average number of instructions per cycle What is done in an instruction + which frequency ?

GIPS / gigaflops GIPS= billions of instructions per second for a given application, on a given data set, on a given ISA and and using a given compiler: allows to compare microarchitecture (but ignore the compiler!!) gigaflops = millions of floating point operations per second Used in scientific computing: inherent to the algorithm

How are used transistors: evolution in the 40 last years • In the 70’s: enriching the ISA • Increasing functionalities to decrease instruction number • In the 80’s: caches and registers • Decreasing external accesses • ISAs from 8 to 16 to 32 bits • In the 90’s: instruction parallelism • More instructions, lot of control, lot of speculation • More caches • In the 2000’s: • More and more: • Thread parallelism, core parallelism

MICROARCHITECTURE OF THE SEQUENTIAL PROCESSOR

The (traditional) sequential application hardware/software interface The application programmer High level language software Compiler/code generation Instruction Set Architecture (ISA) micro-architecture hardware transistor

Instruction Set Architecture (ISA) • Hardware/software interface: • The compiler translates programs in instructions • The hardware executes instructions • Examples • Intel x86 (1979): still your PC ISA • MIPS , SPARC (mid 80’s) • Alpha, PowerPC ( 90’ s) • Arm: in your phone !! And in most embedded systems • ISAs evolve by succesive add-ons: • 16 bits to 32 bits then 64 bits, new multimedia instructions, etc • Introduction of a new ISA requires good reasons: • New application domains, new constraints • No legacy code

Binary compatibility • around a billion PCs are executing legacy codes: • Introducing a new ISA is a risky business !! • Moving world ?: • Embedded processors: • The toaster end user does not care about compatibility  But the toaster manufacturer might  • ARMs everywhere: • Smartphones, TVs, settop boxes: 25 billions ARMs around !!

32 vs. 64 bits Architecture • 32 bits architecture: • virtual address is computed on 32 bits • PowerPC, x86, Arm V7 • 64 bits architecture: • virtual address is computed on 64 bits • Till the mid 90’s : MIPS III, Alpha, Sparc V9, HP-PA 2.x, IA64 • PowerPC, x86: 64 bits standard now; • Labtops features 8 GB in supermarkets ! • Arm V8: • Smartphones, tablets with 8GB of memory • Applications are memory hungry: • First, database, scientific applications • games, multimedia

What is this lecture about ? • Memory access time is 100 ns • Program semantic is sequential • But modern processors : 4 instructions every 0.25 ns. • How can we achieve that ?

A few technological facts • Frequency : up to 4-5 Ghz • An ALU operation: 1 cycle • A floating point operation : 3 cycles • Read/write of a registre: 2-3 cycles • Often a critical path ... • Read/write of the L1 cache: 1-3 cycles • Depends on many implementation choices

A few technological parameters • Integration technology: 28 nm – 22 nm -14 nm • More than one billion transistors for large processor • 0 -150 Watts • > 75 watts: high-end servers • < 75 watts: desktop/server cooling at reasonable cost • < 20 watts: laptop power consumption (battery) • < 1 watt: smartphone (battery and hand palm)

Economic (2015) • Processors x86 pour PC: • Low end: 42 $ (Celeron) • High end: 7,000 $ (Haswell, 18 cores) • DDR4 DRAM 2133 Mhz • 2.5 $ for 4 Gbchip

The architect challenge • 2 technology generations ahead • What will you use for performance ? • Memory hierarchy • Pipelining • Instruction Level Parallelism • Speculative execution • Thread parallelism (more cores)

The memory hierarchy

Many memory access instructions • Often 1 out of 4 instructions are memory access instructions • 1 write for 3 reads in average • Instructions are read on memory

Memory components • Most transistors in a computer system are memory transistors: • Main memory: • Usually DRAM • 4 Gbytes is standard in PCs (2011) • Long access time • On chip single ported memory: • Caches, predictors, .. • On chip multiported memory: • Register files, L1 cache, ..

Main memory latency • Memory latency is the delay to get the data when the address has been genetated: • Much longer than just the response time of DRAM component • Memory latency: • Virtual to physical address translation • Crossing the whole memory hierarchy • Processor pins crossing • External bus and memory controller delay ( arbitration) • Multiplexing when sevaral memory components • DRAM access, and the path back • Main memory latency: 100-200 ns • 3-4 GHz, 300-600 CPU cycles or around 1000 instructions !! Problem !!

Memory hierarchy • Memory is : • either huge, but slow • or small, but fast The smallest, the fastest • Memory hierarchy goal: • Provide the illusion that the whole memory is fast • Principle: exploit the temporal and spatial locality properties of most applications

Locality property • On most applications, the following property applies: • Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future • Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.

A few examples of locality • Temporal locality: • Loop index, loop invariants, .. • Instructions: loops, .. • 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less ) • Spatial locality: • Arrays of data, data structure • Instructions: the next instruction after a non-branch inst is always executed

Cache memory • A cache is small memory which content is an image of a subset of the main memory. • A reference to memory is • 1) presented to the cache • 2) on a miss, the request is presented to next level in the memory hierarchy (2nd level cache or main memory)

Cache Tag Identifies the memory block memory Cache line Load &A If the address of the block sits in the tag array then the block is present in the cache A

Memory hierarchy behavior may dictate performance • Example : • 4 instructions/cycle, • 1 memory acces per cycle • 10 cycle penalty for accessing 2nd level cache • 300 cycles round-trip to memory • 2% miss on instructions, 4% miss on data, 1 reference out 4 missing on L2 • To execute 400 instructions : 1320 cycles !!

Memory to cache transfers • General formula for accessing a K word block, word being the width of the bus: • T = a + (K-1) b • Access time a for accessing the first word is long: • round-trip to memory • Structure of DRAM memory favors burst mode • One can resume execution as soon as the missing word is available: • The missing word is loaded first: visible penalty a

Data placement on the cache • A block has a single cache location possibility • DIRECT MAPPED cache • A block can be placed at any location in the cache: • FULLY ASSOCIATIVE cache • A block can be placed at a few places (N) in the cache: • SET ASSOCIATIVE cache

Direct mapped cache Tag Way Offset Data out Compare Hit or Miss

Fully associative cache Tag Offset Compare Data out

Set associative cache Tag Way Offset Data out

Direct mapped cache • Single place for a block • Short hit time • Poor hit ratio: • Ping-pong phenomenon

Fully associative t • Long access time • High power consumption • Limited size

Set associative cache • N places for a block • Amortize conflicts • Good tradeoff

Replacing a block • When a set is full, one has chose a block to replace • RANDOM policy: pick randomly a block and replace it • LRU: Least Recently Used • Random: • Easy to implement, but not very efficient in practice • LRU: • good in average • but pathological artefacts

Writing on a cache • Stores must be propagated to main memory: • immediately: WRITE THROUGH • or later : WRITE BACK ( when the block is rejected from the cachel) • WRITE THROUGH: • important traffic towards memory • WRITE BACK: • Less traffic with memory (or L+1 cache) • Coherency issues between memory and cache

Writing on a cache (2) • On a miss, the (dirty) rejected data block must be written back on memory: • The missing block is needed immediately, the write can be delayed Use of a write buffer: • the dirty block is intermediately stored in a buffer waiting for a slot in the memory access to be effectively written on memory

Block size • Long blocks: • Exploits the spatial locality • Loads useless words when spatial locality is poor. • Short blocks: • Misses on contiguous blocks • Experimentally: • 16 - 64 bytes for small caches 8-32 Kbytes • 64-128 bytes for large caches 256K-4Mbytes

Cache hierarchy • Cache hierarchy becomes a standard: • L1: small (<= 64Kbytes), short access time (1-3 cycles) • Inst and data caches • L2: longer access time (7-15 cycles), 512K-2Mbytes • Unified • L3: 4 -32 Mbytes (20-30 cycles) • Unified, shared on multiprocessor

The coherency issue • Several copies of the same block are present in the memory hierarchy: • Main memory • L3 cache and/or memory write buffer • L2 cache and/or memory write buffer • L1 cache and/or memory write buffer • How to recognize the correct block after a write ? • On a single processor? With some external peripherical ? • On a multiprocessor ?

Inclusion or not ? • Inclusion property: • any block in the L1 cache is also in the L2 cache. • Advantage • The external transactions can be monitored at L2 cache level: • No need to check the L1 cache content • Drawback • Some kind of storage wasting

Cache misses do not stop a processor (completely) • On a L1 cache miss: • The request is sent to the L2 cache, but sequencing and execution continues • On a L2 hit, latency is simply a few cycles • On a L2 miss and L3 miss, latency is hundred of cycles: • Execution stops after a while • Out-of-order execution allows to initiate several L2 cache misses (serviced in a pipeline mode) at the same time: • Latency is partially hiden (later in the course)

Prefetching • If one can avoid waiting 2-300 cycles .. • one can try to anticipate misses and load the (future) missing blocks in the cache in advance: • Many techniques: • Sequential prefetching: prefetch the sequential blocks • When the application is streaming through an array • Stride prefetching: recognize a stride pattern and prefetch the blocks in that pattern • Hardware and software methods are available: • Many complex issues: latency, pollution, .. • Do not waste bandwidth

Virtual memory • Virtuel address VA is computed by the program. VA is then translated through the page table in a physical address PA. • VA allows to execute different independant processes • VA allows to exploit an address space larger than the physical memory • Translating VA in PA : • Combines hardware and software techniques

Virtual memory (2) • Typical page sizes: 4 Kb-8Kb-64Kb • Sometimes support for both large and small pages • Page table is indexed with the number of the virtual page + identity process: • Provides a physical page number • Page table is stored in a memory zone where no translation is needed; • Translation is needed on every memory access: • Translation time becomes predominent • In practice, use of a TLB (translation look-aside buffer) • TLB = special purpose cache for address translation • Maps a subset of the page table • Typical 64/128 entries, fully-associative

Advanced Processor Microarchitecture: Achieving High Performance - ADA Lecture

Advanced Processor Microarchitecture: Achieving High Performance - ADA Lecture

Presentation Transcript

ADVANCED ADA AND FAIR HOUSING

Advanced Computer Architectures CSE 8383

Cyberinfrastructure Architectures, Security and Advanced Applications

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION – ARM

Advanced Processor Architectures for Embedded Systems

Advanced Computer Architecture Data-Level Parallel Architectures

Advanced Computer Architectures

Structure of Computer Systems (Advanced Computer Architectures)

Advanced Computer Architectures – HB49 –

Running M3D on Advanced Computing Architectures

Security Architectures and Advanced Networks

Part VII Advanced Architectures

Chapter 10 Advanced Network Architectures

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Advanced Tool Architectures

Cyberinfrastructure Architectures, Security and Advanced Applications

Security Architectures and Advanced Networks

Advanced Computer Architectures

Running M3D on Advanced Computing Architectures

Advanced Tool Architectures Supporting Interface-Based Design