1 / 211

ADA: Advanced Architectures

ADA: Advanced Architectures. A few information. A. Seznec, microarchitecture, 4 x 2h S. Collange, GPU/accelerator architecture, 2 x 2h S. Derrien, SoC + High Level Synthesis, 4 x 2h Evaluation: 2 h exam Written synthesis of a research paper + oral presentation. ADA : Advanced Architecture.

reevesm
Download Presentation

ADA: Advanced Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADA: Advanced Architectures

  2. A few information • A. Seznec, microarchitecture, 4 x 2h • S. Collange, GPU/accelerator architecture, 2 x 2h • S. Derrien, SoC + High Level Synthesis, 4 x 2h • Evaluation: • 2 h exam • Written synthesis of a research paper + oral presentation

  3. ADA : Advanced Architecture Processor Microarchitecture Slides at https://team.inria.fr/alf/ada2017/

  4. Microarchitecture • hardware organization of processor chips • At a macroscopic level • Nor at the transistor level neither at the gate level • But understanding the processor organization at the functional unit level

  5. General Purposes Multicores

  6. Moore’s « Law » • Nb of transistors on a microprocessor chip doubles every 18 months • 1972: 2000 transistors (Intel 4004) • 1979: 30000 transistors (Intel 8086) • 1989: 1 M transistors (Intel 80486) • 1999: 130 M transistors (HP PA-8500) • 2005: 1,7 billion transistors (Intel Itanium Montecito) • Processor performance doubles every 18 months • 1989: Intel 80486 16 Mhz (< 1inst/cycle) • 1993 : Intel Pentium 66 Mhz x 2 inst/cycle • 1995: Intel PentiumPro 150 Mhz x 3 inst/cycle • 06/2000: Intel Pentium III 1Ghz x 3 inst/cycle • 09/2002: Intel Pentium 4 2.8 Ghz x 3 inst/cycle • 09/2005: Intel Pentium 4, dual core 3.2 Ghz x 3 inst/cycles x 2 threads • 09/2008: Intel Nehalem quadcore, 3.2 Ghz x 3 (5) inst/cycle x 2 threads • 06/2013: Intel Haswell, up to 18 cores, 3.5 (3.9) Ghz x 3 (8) inst x 2 threads

  7. This lecture • How to use the increasing number of transistors and the frequency gain to achieve HIGH PERFORMANCE • on a sequential program

  8. What is performance ? • Depends on the effective user and the application • Servers: • increase the throughput • optimize the response time • Decrease the execution time on a fixed workload • Increase the workload on a fixed time interval • Execute some workload in a determined interval: real time • No need to execute faster, but guarantee the response time Not easy to manipulate: need some metrics

  9. Possible performance metrics Clock frequency: Technology and microarchitecture dependent What is done in a clock cycle Average number of instructions per cycle What is done in an instruction + which frequency ?

  10. GIPS / gigaflops GIPS= billions of instructions per second for a given application, on a given data set, on a given ISA and and using a given compiler: allows to compare microarchitecture (but ignore the compiler!!) gigaflops = millions of floating point operations per second Used in scientific computing: inherent to the algorithm

  11. How are used transistors: evolution in the 40 last years • In the 70’s: enriching the ISA • Increasing functionalities to decrease instruction number • In the 80’s: caches and registers • Decreasing external accesses • ISAs from 8 to 16 to 32 bits • In the 90’s: instruction parallelism • More instructions, lot of control, lot of speculation • More caches • In the 2000’s: • More and more: • Thread parallelism, core parallelism

  12. MICROARCHITECTURE OF THE SEQUENTIAL PROCESSOR

  13. The (traditional) sequential application hardware/software interface The application programmer High level language software Compiler/code generation Instruction Set Architecture (ISA) micro-architecture hardware transistor

  14. Instruction Set Architecture (ISA) • Hardware/software interface: • The compiler translates programs in instructions • The hardware executes instructions • Examples • Intel x86 (1979): still your PC ISA • MIPS , SPARC (mid 80’s) • Alpha, PowerPC ( 90’ s) • Arm: in your phone !! And in most embedded systems • ISAs evolve by succesive add-ons: • 16 bits to 32 bits then 64 bits, new multimedia instructions, etc • Introduction of a new ISA requires good reasons: • New application domains, new constraints • No legacy code

  15. Binary compatibility • around a billion PCs are executing legacy codes: • Introducing a new ISA is a risky business !! • Moving world ?: • Embedded processors: • The toaster end user does not care about compatibility  But the toaster manufacturer might  • ARMs everywhere: • Smartphones, TVs, settop boxes: 25 billions ARMs around !!

  16. 32 vs. 64 bits Architecture • 32 bits architecture: • virtual address is computed on 32 bits • PowerPC, x86, Arm V7 • 64 bits architecture: • virtual address is computed on 64 bits • Till the mid 90’s : MIPS III, Alpha, Sparc V9, HP-PA 2.x, IA64 • PowerPC, x86: 64 bits standard now; • Labtops features 8 GB in supermarkets ! • Arm V8: • Smartphones, tablets with 8GB of memory • Applications are memory hungry: • First, database, scientific applications • games, multimedia

  17. What is this lecture about ? • Memory access time is 100 ns • Program semantic is sequential • But modern processors : 4 instructions every 0.25 ns. • How can we achieve that ?

  18. A few technological facts • Frequency : up to 4-5 Ghz • An ALU operation: 1 cycle • A floating point operation : 3 cycles • Read/write of a registre: 2-3 cycles • Often a critical path ... • Read/write of the L1 cache: 1-3 cycles • Depends on many implementation choices

  19. A few technological parameters • Integration technology: 28 nm – 22 nm -14 nm • More than one billion transistors for large processor • 0 -150 Watts • > 75 watts: high-end servers • < 75 watts: desktop/server cooling at reasonable cost • < 20 watts: laptop power consumption (battery) • < 1 watt: smartphone (battery and hand palm)

  20. Economic (2015) • Processors x86 pour PC: • Low end: 42 $ (Celeron) • High end: 7,000 $ (Haswell, 18 cores) • DDR4 DRAM 2133 Mhz • 2.5 $ for 4 Gbchip

  21. The architect challenge • 2 technology generations ahead • What will you use for performance ? • Memory hierarchy • Pipelining • Instruction Level Parallelism • Speculative execution • Thread parallelism (more cores)

  22. The memory hierarchy

  23. Many memory access instructions • Often 1 out of 4 instructions are memory access instructions • 1 write for 3 reads in average • Instructions are read on memory

  24. Memory components • Most transistors in a computer system are memory transistors: • Main memory: • Usually DRAM • 4 Gbytes is standard in PCs (2011) • Long access time • On chip single ported memory: • Caches, predictors, .. • On chip multiported memory: • Register files, L1 cache, ..

  25. Main memory latency • Memory latency is the delay to get the data when the address has been genetated: • Much longer than just the response time of DRAM component • Memory latency: • Virtual to physical address translation • Crossing the whole memory hierarchy • Processor pins crossing • External bus and memory controller delay ( arbitration) • Multiplexing when sevaral memory components • DRAM access, and the path back • Main memory latency: 100-200 ns • 3-4 GHz, 300-600 CPU cycles or around 1000 instructions !! Problem !!

  26. Memory hierarchy • Memory is : • either huge, but slow • or small, but fast The smallest, the fastest • Memory hierarchy goal: • Provide the illusion that the whole memory is fast • Principle: exploit the temporal and spatial locality properties of most applications

  27. Locality property • On most applications, the following property applies: • Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future • Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.

  28. A few examples of locality • Temporal locality: • Loop index, loop invariants, .. • Instructions: loops, .. • 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less ) • Spatial locality: • Arrays of data, data structure • Instructions: the next instruction after a non-branch inst is always executed

  29. Cache memory • A cache is small memory which content is an image of a subset of the main memory. • A reference to memory is • 1) presented to the cache • 2) on a miss, the request is presented to next level in the memory hierarchy (2nd level cache or main memory)

  30. Cache Tag Identifies the memory block memory Cache line Load &A If the address of the block sits in the tag array then the block is present in the cache A

  31. Memory hierarchy behavior may dictate performance • Example : • 4 instructions/cycle, • 1 memory acces per cycle • 10 cycle penalty for accessing 2nd level cache • 300 cycles round-trip to memory • 2% miss on instructions, 4% miss on data, 1 reference out 4 missing on L2 • To execute 400 instructions : 1320 cycles !!

  32. Memory to cache transfers • General formula for accessing a K word block, word being the width of the bus: • T = a + (K-1) b • Access time a for accessing the first word is long: • round-trip to memory • Structure of DRAM memory favors burst mode • One can resume execution as soon as the missing word is available: • The missing word is loaded first: visible penalty a

  33. Data placement on the cache • A block has a single cache location possibility • DIRECT MAPPED cache • A block can be placed at any location in the cache: • FULLY ASSOCIATIVE cache • A block can be placed at a few places (N) in the cache: • SET ASSOCIATIVE cache

  34. Direct mapped cache Tag Way Offset Data out Compare Hit or Miss

  35. Fully associative cache Tag Offset Compare Data out

  36. Set associative cache Tag Way Offset Data out

  37. Direct mapped cache • Single place for a block • Short hit time • Poor hit ratio: • Ping-pong phenomenon

  38. Fully associative t • Long access time • High power consumption • Limited size

  39. Set associative cache • N places for a block • Amortize conflicts • Good tradeoff

  40. Replacing a block • When a set is full, one has chose a block to replace • RANDOM policy: pick randomly a block and replace it • LRU: Least Recently Used • Random: • Easy to implement, but not very efficient in practice • LRU: • good in average • but pathological artefacts

  41. Writing on a cache • Stores must be propagated to main memory: • immediately: WRITE THROUGH • or later : WRITE BACK ( when the block is rejected from the cachel) • WRITE THROUGH: • important traffic towards memory • WRITE BACK: • Less traffic with memory (or L+1 cache) • Coherency issues between memory and cache

  42. Writing on a cache (2) • On a miss, the (dirty) rejected data block must be written back on memory: • The missing block is needed immediately, the write can be delayed Use of a write buffer: • the dirty block is intermediately stored in a buffer waiting for a slot in the memory access to be effectively written on memory

  43. Block size • Long blocks: • Exploits the spatial locality • Loads useless words when spatial locality is poor. • Short blocks: • Misses on contiguous blocks • Experimentally: • 16 - 64 bytes for small caches 8-32 Kbytes • 64-128 bytes for large caches 256K-4Mbytes

  44. Cache hierarchy • Cache hierarchy becomes a standard: • L1: small (<= 64Kbytes), short access time (1-3 cycles) • Inst and data caches • L2: longer access time (7-15 cycles), 512K-2Mbytes • Unified • L3: 4 -32 Mbytes (20-30 cycles) • Unified, shared on multiprocessor

  45. The coherency issue • Several copies of the same block are present in the memory hierarchy: • Main memory • L3 cache and/or memory write buffer • L2 cache and/or memory write buffer • L1 cache and/or memory write buffer • How to recognize the correct block after a write ? • On a single processor? With some external peripherical ? • On a multiprocessor ?

  46. Inclusion or not ? • Inclusion property: • any block in the L1 cache is also in the L2 cache. • Advantage • The external transactions can be monitored at L2 cache level: • No need to check the L1 cache content • Drawback • Some kind of storage wasting

  47. Cache misses do not stop a processor (completely) • On a L1 cache miss: • The request is sent to the L2 cache, but sequencing and execution continues • On a L2 hit, latency is simply a few cycles • On a L2 miss and L3 miss, latency is hundred of cycles: • Execution stops after a while • Out-of-order execution allows to initiate several L2 cache misses (serviced in a pipeline mode) at the same time: • Latency is partially hiden (later in the course)

  48. Prefetching • If one can avoid waiting 2-300 cycles .. • one can try to anticipate misses and load the (future) missing blocks in the cache in advance: • Many techniques: • Sequential prefetching: prefetch the sequential blocks • When the application is streaming through an array • Stride prefetching: recognize a stride pattern and prefetch the blocks in that pattern • Hardware and software methods are available: • Many complex issues: latency, pollution, .. • Do not waste bandwidth

  49. Virtual memory • Virtuel address VA is computed by the program. VA is then translated through the page table in a physical address PA. • VA allows to execute different independant processes • VA allows to exploit an address space larger than the physical memory • Translating VA in PA : • Combines hardware and software techniques

  50. Virtual memory (2) • Typical page sizes: 4 Kb-8Kb-64Kb • Sometimes support for both large and small pages • Page table is indexed with the number of the virtual page + identity process: • Provides a physical page number • Page table is stored in a memory zone where no translation is needed; • Translation is needed on every memory access: • Translation time becomes predominent • In practice, use of a TLB (translation look-aside buffer) • TLB = special purpose cache for address translation • Maps a subset of the page table • Typical 64/128 entries, fully-associative

More Related