1 / 72

chapter 5

Overview. Review memory hierarchy conceptsReview cache definitionsCache performanceMethods for improving cache performanceReducing miss rateReducing miss penaltyReducing hit timeMain memory performanceMemory technologyVirtual memoryMemory protection. Principle of Locality. Programs access a relatively small area of address space at any time. (90% of the time in 10% of the code)Temporal locality

Patman
Download Presentation

chapter 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Chapter 5 Memory Hierarchy Design

    3. Principle of Locality Programs access a relatively small area of address space at any time. (90% of the time in 10% of the code) Temporal locality if a memory location is accessed, it is likely to be accessed again soon. Spatial locality if a memory location is accessed, nearly locations are likely to be accessed.

    4. Memory Hierarchy

    5. Memory Hierarchy Implementation

    6. Memory and CPU Performance Gap

    7. Cache Basics Cache a term that generally applies whenever buffering is used to reuse commonly used items. Cache first level of memory hierarchy encountered by the PC. Cache hit when the CPU finds an item in cache Cache miss when the CPU does not find the item in cache.

    8. Cache Miss On a cache miss, must look for item in Memory (or another level of cache) If item is not in memory page fault, page of the item must be retrieved from disk.

    9. Cache Performance Memory stall cycles cycles spent waiting for memory access.

    10. Cache Performance Cache Misses

    11. Memory Stall Cycles Memory reads and writes will have different penalties Usually we use the average of reads and writes and their penalties

    12. Memory Hierarchy Operation Where can a block be placed in the upper level? (block placement) How is a block found if it is in the upper level? (block identification) Which block should be replaced on a miss? (block replacement) What happens on a write? (write strategy)

    13. Block Placement Where can a block be placed in the upper level? Each block has only one place it can appear in the cache - Direct mapped cache A block can be placed anywhere in the cache Fully associative cache A block can be placed in a restricted set of places in the cache Set associative cache

    14. Block Placement

    15. Direct Mapped Cache Example: 32-bit Memory Address

    16. Set Associative Cache Set is a group of blocks in the cache Block is first mapped to set, then the block can be placed anywhere within the set. For n blocks per set, cache is said to be n-way set associative.

    17. Set Associative Cache Example: 2-way set associative

    18. Fully Associate If cache has N blocks, then there are N possible places to put each memory block. Must compare tags of every block to memory block address.

    19. Block Identification How is a block found if it is in the cache?

    20. Block Replacement Which block should be replaced on a cache miss? Direct mapped cache no choice, only one spot for block Set or fully-associative many blocks to choose from Random to spread allocation uniformly Least recently used (LRU) to reduce the chance of losing blocks that might be needed soon First-in, first-out (FIFO) uses OLDEST block rather than least recently used to simplify selection

    21. Performance for Replacement Strategies

    22. Write Strategy What happens on a write? First, how often do writes occur? Common mix: 10% stores, 37% loads (data accesses) All instruction fetches (100%) are reads Writes: 10% / (100% + 10% + 37%) = 7% of all memory accesses 10% / (10% + 37%) = 21% of all data accesses

    23. Writes are Inherently Slower Reads can occur concurrently with tag comparison (result is ignored if it is a miss) Entire block is always read For Writes: tag must be checked for hit before modification of block occurs Size of write must be specified, and only that part of block can be changed.

    24. Write Policy Options Write through information is written to both block in cache AND block in Memory Write back information is written only to the block in the cache. Modified cache block is written to Memory when it is replaced.

    25. Write Back Dirty bit status bit to indicate cache block has been modified and differs from Memory Writes are faster, since a write to Memory is only needed when block is replaced Less Memory writes, so less power consumed Important for embedded operations

    26. Write Through Easier to implement No writes to Memory necessary on read misses Cache is always consistent with Memory Important for multiprocessors Important for I/O

    27. Optimizations for Writes Write buffer holds data to be written so that CPU execution and Memory write can occur concurrently Reduces stall time for write through Writes can still result in some stall time if buffer is full

    28. Write Miss Strategies Options for write misses: Write allocate block allocated on a write miss often used with write-back No-write allocate block not allocated on a write miss often used with write-through

    29. Example: Alpha 21264 Data Cache 64KB data cache 64-byte blocks 2-way set associative 9-bit index selects among 512 sets 2 blocks per set x 512 sets x 64 bytes/block = 64KB Write-back Write allocate on write miss

    30. Example: Alpha 21264 Data Cache

    31. Cache Miss Cache sends a data not available to CPU 64-byte block is read from next level cache 2-way set associative cache, so 2 choices of places to put new block Round-robin selection (FIFO) is used Block replacement updates Block of data Address tag Valid bit Round-robin bit Write back is used, so replaced block must be written to next level. Written to victim buffer (same as write buffer) Unless victim buffer is full, write to next level from victim buffer takes place concurrently with CPU continuing execution. Write allocate is used, so block is allocated on write miss Read miss is very similar

    32. Alpha Instruction Cache Separate data and instruction cache Avoids structural hazards when both instruction fetch and load/store need cache Allows for separate cache optimizations 64KB Instruction Cache Instruction caches have lower miss rates than data caches

    33. Cache Performance CPU Execution Time = (CPU cycles + Memory stall cycles) x cycle time

    34. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Memory protection

    35. Cache Performance Effects of cache performance on overall execution time CPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycleCPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycle

    36. Miss Penalties Out of Order Execution Processors For out of order execution Miss penalty can overlap execution Redefine memory stalls to be miss penalty of non-overlapped latency

    37. Summary of Cache Performance Figure 5.9 provides all previously presented equations good reference CPU execution time Memory stall cycles Average memory access time Relationship between index size, cache size, block size, set associativity

    38. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    39. Reducing the Miss Penalty Multilevel caches

    40. Multi-level Caches Local miss rate miss rate of individual caches considering their individual requests: L1 Cache: Miss rateL1 L2 Cache: Miss rateL2 Global miss rate overall miss rate of caches, considering all requests: L1 Cache: Miss rateL1 L2 Cache: Miss rateL1 x Miss rateL2 Average memory stalls per instruction = Misses per instructionL1 x hit timeL2 + Misses per instructionL2 x miss penaltyL2

    41. Multi-level Caches Note that local miss rate is very different than global miss rate Local miss rate is very high for L2 caches smaller than L1 cache Global miss rate is very close to single cache miss rate Note that local miss rate is very different than global miss rate Local miss rate is very high for L2 caches smaller than L1 cache Global miss rate is very close to single cache miss rate

    42. Design Issues for Second Level Caches Set associative? Usually this improves performance Block size? Usually kept the same as L1 If L2 can only be slightly bigger than L1: Multi-level exclusion option: L1 data is never found in L2. Cache miss in L1 results in swap of blocks between L1 and L2

    43. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    44. Critical Word First Based on observation that CPU normally needs just one word of cache block Critical Word First on a cache miss, read missed word from memory and send to CPU as soon as it arrives: fill rest of the words while CPU is executing Also called wrapped fetch and requested word first

    45. Early Restart Similar to previous approach, but fetch words in order. As soon as requested word arrives, send to CPU. Both techniques are most beneficial when block size is large.

    46. Giving Priority to Read Misses over Writes Serve reads before writes have completed Requires a write buffer Must avoid errors from memory RAW hazards

    47. Giving Priority to Read Misses over Writes Wait for write buffer is empty (slow) Check write buffer contents on a miss (most common solution) Similar techniques for write-back cache When read miss will replace dirty block Copy dirty block to a buffer (to be written later) Then read memory Then write memory

    48. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    49. Merging Write Buffer

    50. Victim Caches

    51. Summary of Miss Penalty Techniques Add: More cache levels Impatience: Critical word first, early restart Preference: Reads before writes Efficiency: Merging words in write buffer Recycling: Victim cache

    52. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    53. Miss Categories Compulsory first access to a block (cold-start misses or first-reference misses) Capacity cache is full of other blocks (occurs in fully-associative caches) Conflict for set associative or direct mapped caches a miss because too many blocks map to a set or entry.

    54. Miss Rates for Categories

    55. Larger Block Size

    56. Larger Caches An obvious way to reduce capacity misses Tradeoffs: Longer hit times Higher cost Most used in off-chip caches (L2, L3)

    57. Higher Associativity Some observations: 8-way has similar performance to fully associative 2:1 rule of thumb: Direct-mapped cache of size N has about the same miss rate as 2-way set associative cache of N/2. Increased associativity increases hit time

    58. Way Prediction Way prediction: extra bits are kept in the cache to predict the way Multiplexor is set early to select the desired block Only a single tag comparison is performed Miss results in checking other blocks Reduces conflict misses over direct mapping Maintains hit speed of direct mapping

    59. Way Prediction in Alpha 21264 2-way set associative cache Block predictor bit added to each block of instruction cache Selects which of the 2 blocks to try on next cache access Correct instruction latency 1 cycle Incorrect try other block, change way predictor, instruction latency of 3 cycles

    60. Compiler Optimizations So far to reduce misses: larger blocks, larger caches, higher associativity, way prediction changes to hardware Some compiler techniques used: use profiling information to reorder instructions to avoid likely conflicts put program entry points at beginning of blocks Transform data storage so that program operates on data within blocks.

    61. Compiler Optimizations Example: Exchange nesting of loops to make code access data in the order the data is stored.

    62. Summary of Reducing Miss Rate Hardware approaches larger blocks, larger caches, higher associativity, way prediction Compiler approaches Yet another connection between architecture implementation and the compiler

    63. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    64. Nonblocking Caches (lockup-free cache) Allows data cache to continue to supply cache hits during a miss hit under miss reduces effective miss penalty by responding to CPU hits during time waiting for miss hit under multiple miss overlap of hit with multiple misses Both require significant increases in cache controller complexity

    65. Hardware Prefetching Can be done with data or instructions Directly into caches or into an external buffer Instruction prefetch Done with separate prefetch controller outside of cache On a miss, prefetch 2 blocks the missed block and the one after it.

    66. Hardware Prefetching

    67. Hardware Prefetching

    68. External Buffer Checked on cache miss before going to main memory If a hit, moved to cache Another prefetch replaces block in external buffer

    69. Compiler Controlled Prefetching Compiler inserts prefetch instructions Prefetch data before it is needed Example

    70. Compiler Controlled Prefetching Two approaches: Register prefetch load data into CPU registers Cache prefetch load data into cache only Two types Nonfaulting prefetch turns into no-op if memory address is protected or causes page fault Also called non-binding prefetch Most modern cpus have nonfaulting cache prefetching

    71. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

    72. Small and Simple Caches Comparison of address index and cache tag most time consuming Direct-mapped caches are faster can overlap tag checking with data transmission Pressure of matching cache access with CPU clock cycle has resulted in leveling off of L1 cache sizes in current computers

    73. Associativity Impact

    74. Pipelined Cache Access Allows effective latency to be multiple clock cycles Advantages Short clock cycle Slow hits Disadvantages Increases number of pipeline stages Greater branch penalties More cycles between issue of load and data availability

    75. Trace Caches Instruction cache Cache blocks consist of traces of instructions Traces determined dynamically by CPU Traces include taken branches Used with branch prediction Complicated address mapping mechanisms Can store the same instruction multiple times

    76. Cache Optimization Summary Figure 5.26 Most optimizations only help one factor (miss rate, miss penalty, hit time) Optimizations usually come with a hardware complexity cost

    77. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Memory protection

    78. Main Memory Optimizations Main memory latency = cache miss penalty (in time or cycles) Main memory bandwidth (bytes/time or cycles) Most important for I/O Important for multiprocessors Also important for L2 caches with large block sizes Bandwidth is easier to optimize than latency

    79. Example Base Memory 4 clock cycles to send address 56 clock cycles for access time per word 4 clock cycles to send a word of data 64 clock cycles total Cache structure: 4 word blocks Word = 8-bytes Miss penalty 4x(4+56+4) = 256 clock cycles Bandwidth = 32/256 (1/8) bytes per clock cycle

    80. Memory Optimization Options

    81. Wider Memory Double width ? double throughput Cache miss ? 128 clock cycles (2 x 64) Throughput ? 32/128 = bytes per cycle Multiplexor adds delay Complicates error correction within memory

    82. Simple Interleaved Memory Memory organized into separate banks Address sent to all, reads occur in parallel Logically a wide memory, but accesses staged over time to share memory bus

    83. Effect on Throughput Throughput for our example: 4 clock cycles to send address (in parallel) 56 clock cycles for access time (in parallel) 4x4 clock cycles to send a block of data Cache miss penalty =75 clock cycles Throughput = 32/75 = 0.4 bytes/cycle

    84. Writes with Interleaved Memory Back to back writes can be pipelined If the accesses are to separate banks Avoids stalling while first wait to complete

    85. Address Mapping for Interleaved Memory

    86. Independent Memory Banks Allow multiple independent accesses Multiple memory controllers Banks operate somewhat independently More expensive, but more flexible Useful for non-blocking caches Useful for multi-processors

    87. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Examples

    88. Memory Technology Memory latency access time time between read request and data arrival cycle time time between sequential requests not equivalent due to time required for setup of address lines Memory types DRAM SRAM

    89. DRAM Dynamic RAM must be refreshed to hold contents contents are not static Information stored on capacitors Storage cells very small Large capacity Low cost per bit Slower than SRAM Usually packaged as a set of chips

    90. DRAM Internal Organization

    91. SRAM Static RAM contents are statically held by feedback circuit 6 transistors per storage cell Good for low power devices (no refresh) Faster, but smaller, than DRAM Address lines not multiplexed No difference between access time and cycle time Used to implement caches

    92. DRAM vs SRAM For similar technologies: Speed SRAM is 8-16 times faster Capacity DRAM is 4-8 times larger Cost SRAM is 8-16 times more expensive per bit

    93. Embedded System Memory Issues Non-volatile storage (Desktops use disks) ROM read-only memory Various types some can be user programmed PROM, EPROM, EEPROM FLASH memory similar to EEPROM Very slow to write (10-100 times slower than DRAM)

    94. DRAM Performance Improvements Internal organizations Repeated accesses to a single row (fast page mode) Synchronous memory controller (SDRAM) Transfer data on both edges of DRAM clock signal Double Data Rate (DDR DRAM) All require some internal hardware overhead

    95. DRAM Performance Improvements Interface bus designs RAMBUS (DRDRAM) Single chip that includes interleaved DDR SDRAM and high-speed interface Packaged in RIMM same package size as DIMM but incompatible Improved bandwidth, but not latency Expensive (2x DRAM DIMMs of same size)

    96. Virtual Memory Analogous to CACHE/Main Memory Interface cache block similar to VM page (or segment) Management of main Memory/Secondary storage interface

    97. Virtual Memory Manages sharing of protected memory space Relocation mechanism for loading programs for execution from any location CPU produces virtual address Memory mapping (address translation) Translation of virtual address to physical address

    98. Cache and Virtual Memory Comparison Virtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss ratesVirtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss rates

    99. Memory Hierarchy Questions Where can a page be placed in memory? How is a page found if it is in main memory? Which page should be replaced in a virtual memory miss? What happens on a write?

    100. Placement in Memory Very high miss penalty when reading from disk Must keep miss rate very low Pages can be placed anywhere in main memory (fully associative)

    101. Finding Pages in Main Memory Indexed by page or segment number Page table maps virtual page to physical address

    102. Address Translation Page table must be large enough to hold number of pages in virtual address space Translation Lookaside Buffer (TLB or TB) cache often used for page table TLB contains portions of the virtual address and the physical page frame number Includes dirty bits, use bits, protection field, and valid bits

    103. Alpha 21264 TLB

    104. Replacement Mechanisms To minimize page faults use LRU method Use bit (or reference bit) used to keep track Set whenever a page is accessed Recorded and cleared periodically by operating system

    105. What Happens on Write? Write back is always used disk access time is too long to consider write-through

    106. Big Picture Memory Hierarchy

    107. Cache and Superscaler CPU Multi-issue CPUs require high bandwidth in cache Multiple cache ports for multiple instruction fetches per clock cycle Cache must be nonblocking (to allow hits over misses)

    108. Speculative Execution and Memory Memory system must identify speculatively executed instructions Exceptions for speculative instructions suppressed Nonblocking cache so that stalls do not occur on miss for speculative instruction

    109. Embedded System Memory Real time constraints require very little performance variability Caches increase variability (but improve average performance) Instruction caches are predictable and widely used Caches save power (on-chip access vs off-chip access) Way prediction can lower power by using only of the comparators at a time.

    110. I/O and Memory Problem of keeping cache consistent with Memory when I/O can write to Memory Connect I/O to cache (makes cache slow) Use write-through cache (not common for L2) Make I/O parts of memory non-cachable OS flushes cache blocks before accepting I/O Hardware controller to handle cache consistency with I/O

    111. Example Memory Organizations Alpha 21264 out of order execution fetches up to 4 instructions per clock cycle Emotion Engine of Sony Playstation embedded processor high demands for audio and graphics Sun Fire 6800 Server commercial computing database applications special features for availability and maintainability

    113. Alpha 21264 41-bit physical address (43-bit virtual address) Physical address space: Instruction cache uses way prediction L1 miss penalty: 16 cycles L2 miss penalty: 130 cycles

    114. Emotion Engine Sony Playstation

    115. Sony Playstation Lots of I/O to interface with memory 10-channel DMA (Direct Memory Access) I/O interface I/O processor with several interfaces Memory embedded with all processors Dedicated buses 1-cycle latency for all embedded memories

    116. Interesting Features LARGE chip sizes: Emotion Engine: 225 mm2 Graphics Synthesizer: 279 mm2 (Alpha 21264 is about 160 mm2) 9 distinct independent memory modules Programmer (compiler) must keep memories consistent

    117. Sun Fire 6800 Server Midrange multi-processor server Databases with large code sizes Many capacity and conflict misses Lots of context switching between processes Multiprocessor have cache coherency problem

    118. Sun Fire 6800 Server

    119. Interesting Features HUGE number of I/O pins (1368) Large number of wide memory paths per processor Peak bandwidth 11 GB/sec To improve latency, tags for L2 cache are on-chip. To improve reliability error correction bits 8-bit back door diagnostic bus redundant path between processors dual redundant system controllers

    120. Review Chapters 1-4 (and Appendix A)

    121. Computer Markets Desktop price range of under $1000 - $10,000 optimize price-performance Servers optimize availability and throughput (i.e. transactions per minute) scalability is important Embedded computers computers that are just one component of a larger system (cell phones, printers, network switches, etc) widest range of processing power and cost real-time performance requirements often power and memory must be minimized Three basic markets that computer architects design forThree basic markets that computer architects design for

    122. Design Functional Requirements Application area (desktop, servers, embedded computers) Level of software compatibility (programming level, binary compatible) Operating system requirements (address space size, memory management) Standards (floating point, I/O, networks, etc) At the top of the design process is determining the functional requirements for an architectureAt the top of the design process is determining the functional requirements for an architecture

    123. Design - Technology Trends Integrated Circuit Technology Semiconductor DRAM Magnetic Disk Technology Network technology At the implementation end of the process, it is important to understand technology trends.At the implementation end of the process, it is important to understand technology trends.

    124. IC Manufacturing Process

    125. Computer Performance How do we measure it? Application run time Throughput number of jobs per second Response time Importance of each term depends on the application. Application run time normal PC user Throughput server applications Response time real time applications, transaction processingApplication run time normal PC user Throughput server applications Response time real time applications, transaction processing

    126. Measuring Performance Execution time the actual time between the beginning and end of a program. Includes I/O, memory access, everything. Performance reciprocal of execution time We will focus on execution timeWe will focus on execution time

    127. Benchmarks Some typical benchmarks: Whetstone Dhrystone Benchmark suites collections of benchmarks with different characteristics SPEC Standard Performance Evaluation Corporation (www.spec.org) Many types (desktop, server, transaction processing, embedded computer)

    128. Design Guidelines Make the common case fast when making design tradeoffs Amdahls Law:

    129. CPU Performance Equations For a particular program: CPU time = CPU clock cycles x clock cycle time. clock cycle time = Considering instruction count: cycles per instruction (CPI) =

    130. CPU Performance Equation CPU time = Instruction count x clock cycle time x CPI or CPU time = A 2x improvement in any of them is a 2x improvement in the CPU timeA 2x improvement in any of them is a 2x improvement in the CPU time

    131. MIPS as Performance Measure MIPS Millions Instructions Per Second Used in conjunction with benchmarks Dhrystone MIPS Can be computed as follows:

    132. Locality Program property that is often exploited by architects to achieve better performance. Programs tend to reuse data and instructions. (For many programs, 90% of execution time is spent running 10% of the code) Temporal locality recently accessed items are likely to be accessed in the near future. Spatial locality items with addresses near one another tend to be referenced close together in time.

    133. Parallelism One of the most important methods for improving performance. System level multiple CPUs, multiple disks CPU level pipelining, multiple functional modules Digital design level carry-lookahead adders

    134. Common Mistakes Using only clock rate to compare performance Even if the processor has the same instruction set, and same memory configuration, performance may not scale with clock speed.

    135. Common Mistakes Comparing hand-coded assembly and compiler-generated, high-level language performance. Huge performance gains can be obtained by hand-coding critical loops of benchmark programs. Important to understand how benchmark code is generated when comparing performance.

    136. Chapter 1 Review History of processor development Different processor markets Issues in cpu design Economics of cpu design and implementation Computer performance measures, formulas, results Design guidelines Amdahls law, locality, parallelism

    137. Chapter 2 Overview Taxonomy of ISAs (Instruction Set Architectures) The Role of Compilers Example: The MIPS architecture Example: The Trimedia TM32 CPU

    138. ISA Classification Based on operand location Memory addressing Addressing modes Type and size of operands Operations in the instruction set Instruction flow control Instruction set encoding

    139. Four Basic Types

    140. Trends Almost all modern processors use a load/store architecture Registers are faster to access than memory Registers are more convenient for compilers to use than other forms (like stacks) General purpose registers are more convenient for compilers than special purpose registers (like accumulators)

    141. Conclusions about Memory Addressing Most important non-register addressing modes: displacement, immediate, register indirect Size of displacement field should be at least 12-16 bits. Size of immediate field should be at least 8-16 bits.

    142. Flow Control Terminology Transfer instructions (old term) Branch will be used for conditional flow control Jump will be used for unconditional program flow control Procedure calls Procedure returns

    143. Addressing Modes for Flow Control Flow control instructions must specify destination address PC-relative - Specify displacement from program counter (PC) Requires fewer bits than other modes Practical since branch target is often nearby Allows code to be independent of its location in memory (position-independence) Other modes must be used for returns and indirect jumps

    144. The Anatomy of Compilers

    145. Compilers and ISAs First goal is correctness Second goal is speed of resulting code Other goals: Speed of compilation Debugging support Interoperability with other languages First goal (correctness) is complex, and limits the complexity of optimizations.

    146. Types of Optimizations High-level Optimization on source code, fed to lower level optimizations Local optimizations optimize code only within a straight-line code fragment Global optimizations extend local optimizations across branches and apply transformations to optimize loops Register allocation associate registers with operands Processor-dependent optimizations take advantage of specific architecture features

    147. Example: MIPS ISA 64-bit load-store architecture Full instruction set explained in Appendix C (on text web page)

    148. MIPS Instructions Loads and Stores ALU operations Branches and Jumps Floating point operations

    149. Floating Point Operations Single and double-precision operations indicated by .S and .D MOV.S and MOV.D copy registers of the same type Special instructions for moving data from FPR and GPR Conversion instructions convert integers to floating-point and visa-versa

    150. Media Processor: Trimedia TM32 Dedicated to multimedia processing data communication audio coding video coding video processing graphics Operate on narrower data than PCs Operate on data streams Typically found in set-top boxes

    151. Unique Features of TM32 Lots of registers: 128 (32-bit) Registers can be either integer or floating-point SIMD instructions available Both 2s complement and saturating arithmetic available Programmer can specify five independent instructions to be issued at the same time! nops placed in slots if 5 are not available VLIW (very long instruction word) coding technique

    152. Instruction Set Design: Pitfalls Designing a high-level instruction set feature specifically oriented to supporting a high-level language structure Often makes instructions too specialized to be useful Innovating at the ISA to reduce code size without accounting for the compiler Compilers can make much more impact Use optimized code when considering changes

    153. Instruction Set Design: Fallacies There is such a thing as a typical program programs vary widely in how they use instruction sets An architecture with flaws cannot be successful 80x86 case in point A flawless architecture can be designed All designs contain tradeoffs Technologies change, making previous good decisions bad

    154. ISA Conclusions: Trends in the 1990s Address size 32-bit ? 64-bit Addition of conditional execution instructions Optimization of cache performance via prefetch Support for multimedia Faster floating-point operations

    155. ISA Conclusions: Trends in the 2000s Long instruction words Increased conditional execution Blending of DSP and general purpose architectures 80x86 emulation

    156. Appendix A Overview Introduction Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline Pipeline Hazards Stalls, structural hazards, data hazards Branch hazards Pipeline Implementation Simple MIPS pipeline Implementation Difficulties for Pipelines Exceptions, instruction set complications Extending MIPS Pipeline to Multicycle operations Example: MIPS R4000 Pipeline

    157. CPU Pipelining

    158. Performance and Pipelining For the following assumptions: N stages in the pipeline Unpipelined execution time for 1 instruction is T Pipeline stages are equal and perfectly balanced Then Execution time for pipelined version = Throughput increase is N

    159. Pipeline Execution

    160. Instruction Timing Throughput is increased approximately by 5 Execution time of individual instruction INCREASES due to pipelining overhead Pipeline register delay Clock skew (T = TCL + Tsu + Treg + Tskew ) Important to balance pipeline stages, since clock is matched to slowest stage (TCL)

    161. Pipeline Hazards Structural Hazards resource conflicts when more than one instruction needs a resource Data Hazards an instruction depends on a result from a previous instruction that is not yet available Control Hazards conflicts from branches and jumps that change the PC

    162. Forwarding Solution for hazards Also called bypassing or short-circuiting Create potential datapath from where result is calculated to where it is needed by another instruction Detect hazard to route the result Example:

    163. Branch Hazards

    164. Pipeline Implementation Details of pipeline implementation So that other issues can be explored Look at non-pipelined implementation first Focus on integer subset of MIPS Load-store word Branch equal zero Integer ALU operations Basic principles can be extended to all instructions

    165. Multicycle Datapath

    166. Pipeline Control

    167. Branches in Pipeline Consider only BEQZ and BNEZ (branch if equal to zero or not equal to zero) For these it is possible to move the test to the ID stage To take advantage of early decision, target address must also be computed early Must add another adder for computing target address in ID Result is 1-cycle stall on branches. Branches on result of register from previous ALU operation will result in a data hazard stall.

    168. Exceptions and Pipelines Exceptions can come from several sources and can be classified several ways Sources I/O Device Interrupt Invoking OS from user program Tracing program execution Breakpoint Integer arithmetic overflow or underflow, FP trap Page fault Misaligned memory accesses Memory protection violation Undefined instruction Hardware malfunction Power failure

    169. FP Pipeline

    170. Review of Hazards Caused by different lengths of execution unit pipelines. Structural hazards multiple instructions need the same function unit at the same time RAW data hazards Instruction needs to read a value that has not been written yet WAW data hazards Writes occur out of order

    171. MIPS R4000 Pipeline Implements MIPS-64 Deeper pipeline (8 stages) Superpipeline Higher clock rate (smaller logic in each stage) Additional stages from decomposing memory accesses

    172. AppendixA Summary For ideal N-stage pipeline, throughput increase is N over a non-pipelined architecture Ideal pipelined cpu has CPI=1 Pipelining has advantages of significant speedup with moderate hardware costs invisible to programmer Pipeline challenges include Structural hazards Data hazards Control hazards Exceptions Floating point operations

    173. Solutions include Stalls Forwarding Buffering state (for exceptions) Branch delay slots Branch prediction Several multi-cycle execution units for FP AppendixA Summary

    174. Chapter 3 Overview Instruction Level Parallelism Data Dependence and Hazards Dynamic Scheduling Dynamic Hardware Prediction High-Performance Instruction Delivery Multiple Issue Hardware-Based Speculation

    175. Instruction Level Parallelism Definition: Potential to overlap the execution of instructions Pipelining is one example Limitations of ILP are from data and control hazards Two approaches to overcoming limitations dynamic approaches with hardware (Chapter 3) static approaches that use software (Chapter 4)

    176. CPI for Pipelines CPI = Ideal pipeline CPI + structural stalls + data hazard stalls + control stalls Pipeline performance is sometimes measured in IPC (Instructions Per Clock cycle) = 1/CPI Must fully understand dependencies and hazards to see how much parallelism exists and understand how it can be exploited.

    177. Data Dependencies Data dependencies (true data dependencies) Name dependencies Control dependencies An instruction j is data dependent on instruction i if instruction i produces a result that may be used by instruction j instruction j is data dependent on k, which is data dependent on i.

    178. Dynamic Scheduling Hardware rearranges instruction execution to reduce stalls due to dependencies Can handle some cases where dependencies are not known at compile time Simplifies the compiler Allows code compiled for one pipeline to be run on another (invisible to the compiler) Results in significant hardware complexity

    179. Tomasulos Algorithm/Approach Approach requires tracking instruction dependencies to avoid RAW hazards Requires register renaming to avoid WAR and WAW hazards

    180. MIPS FP Unit using Tomasulos Approach

    181. Dynamic Hardware Prediction Using hardware to dynamically predict the outcome of a branch Prediction depends on the behavior of the branch at run time Effectiveness depends on two things: accuracy in predicting branch cost for correct and incorrect predictions

    182. Multiple-Issue Processors Superscaler processors issue varying numbers of instructions per clock statically scheduled or dynamically scheduled VLIW (Very Long Instruction Word) processors issue a fixed number of instructions formatted as one large instruction or a fixed instruction packet - also called EPIC (Explicitly Parallel Instruction Computers) VLIW/EPIC are inherently statically scheduled

    183. Example Multiple-Issue Processors

    184. Statically Scheduled Superscaler Processors Instructions are issued in order All hazards checked dynamically at issue time Variable number of instructions issued per clock cycle Require the compiler techniques in Chapter 4 to be efficient.

    185. Dynamically Scheduled Superscaler Processor Dynamic scheduling does not restrict the types of instructions that can be issued on a single clock cycle. Think of it as Tomasulos approach extended to support multiple-issue Allows N instructions to be issued whenever reservation stations are available. Branch prediction is used for fetch and issue (but not execute)

    186. Speculative Superscaler Pipelines

    187. Multiple Issue with Speculation For an architecture based on Tomasulos approach, the following are obstacles: Instruction Issue block Single CDB Add dispatch buffer Use multiport reorder buffer

    188. Chapter 4 Overview Basic Compiler Techniques Pipeline scheduling loop unrolling Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling Hardware support for exposing more parallelism Conditional or predicted instructions Compiler speculation with hardware support Hardware vs Software speculation mechanisms Intel IA-64 ISA

    189. Loop Unrolling Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming

    190. Limits to Loop Unrolling Eventually the gains of removing loop overhead diminishes Remaining loop overhead amortization Code size limitations Embedded applications Increase in cache misses Compiler limitations Shortfall in registers Increase in number of live values past # registers

    191. Detecting Parallelism Loop-level parallelism Analyzed at the source level requires recognition of array references, loops, indices. Loop-carried dependence a dependence of one loop iteration on a previous iteration. for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + B[k]; }

    192. Finding Dependencies Important for Efficient scheduling Determining which loops to unroll Eliminating name dependencies Makes finding dependencies difficult: Arrays and pointers in C or C++ Pass by reference parameter passing in FORTRAN

    193. Dependencies in Arrays Array indices, i, are affine if: ai+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine Common example of non-affine index: x[y[i]] (indirect array addressing) For two affine indices: ai+b and ck+d there is a dependence if: GCD(c,a) must divide (d-b) evenly

    194. Software Pipelining Interleaves instructions from different iterations of a loop without unrolling each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulos algorithm start-up and finish-up code required

    195. Software Pipelining

    196. Global Code Scheduling Loop unrolling and software pipelining Improve ILP when loop bodies are straight-ling code (no branches) Control flow (branches) within loops makes both more complex. will require moving instructions across branches Global Code Scheduling moving instructions across branches

    197. Trace Scheduling Advantages eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior Disadvantages significant overhead in compensation code when trace must be exited

    198. Review Loop unrolling Software pipelining Trace scheduling Global code scheduling Problems Unpredictable branches Dependencies between memory references

    199. Compiler Speculation To speculate ambitiously, must have The ability to find instructions that can be speculatively moved and not affect program data flow. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.

    200. Hardware vs Software Speculation Disambiguation of memory references Software hard to do at compile time if program uses pointers Hardware dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulos approach Support for speculative memory references can help compiler, but overhead of recovery is high.

    201. Intel IA-64 Architecture Itanium Implementation IA-64 Instruction set architecture Instruction format Examples of explicit parallelism support Predication and speculation support Itanium Implementation Functional units and instruction issue Performance

    202. Conclusions Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity No clear winner in hardware or software approaches to ILP in general Software helps for conditional instructions and speculative load support Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness

More Related