chapter 5

1. Chapter 5 Memory Hierarchy Design

3. Principle of Locality Programs access a relatively small area of address space at any time. (90% of the time in 10% of the code) Temporal locality � if a memory location is accessed, it is likely to be accessed again soon. Spatial locality � if a memory location is accessed, nearly locations are likely to be accessed.

4. Memory Hierarchy

5. Memory Hierarchy Implementation

6. Memory and CPU Performance Gap

7. Cache Basics Cache � a term that generally applies whenever buffering is used to reuse commonly used items. Cache � first level of memory hierarchy encountered by the PC. Cache hit � when the CPU finds an item in cache Cache miss � when the CPU does not find the item in cache.

8. Cache Miss On a cache miss, must look for item in Memory (or another level of cache) If item is not in memory � page fault, page of the item must be retrieved from disk.

9. Cache Performance Memory stall cycles � cycles spent waiting for memory access.

10. Cache Performance Cache Misses

11. Memory Stall Cycles Memory reads and writes will have different penalties Usually we use the average of reads and writes and their penalties

12. Memory Hierarchy Operation Where can a block be placed in the upper level? (block placement) How is a block found if it is in the upper level? (block identification) Which block should be replaced on a miss? (block replacement) What happens on a write? (write strategy)

13. Block Placement Where can a block be placed in the upper level? Each block has only one place it can appear in the cache - Direct mapped cache A block can be placed anywhere in the cache � Fully associative cache A block can be placed in a restricted set of places in the cache � Set associative cache

14. Block Placement

15. Direct Mapped CacheExample: 32-bit Memory Address

16. Set Associative Cache Set is a group of blocks in the cache Block is first mapped to set, then the block can be placed anywhere within the set. For n blocks per set, cache is said to be n-way set associative.

17. Set Associative CacheExample: 2-way set associative

18. Fully Associate If cache has N blocks, then there are N possible places to put each memory block. Must compare tags of every block to memory block address.

19. Block Identification How is a block found if it is in the cache?

20. Block Replacement Which block should be replaced on a cache miss? Direct mapped cache � no choice, only one spot for block Set or fully-associative � many blocks to choose from Random � to spread allocation uniformly Least recently used (LRU) � to reduce the chance of losing blocks that might be needed soon First-in, first-out (FIFO) � uses OLDEST block rather than least recently used � to simplify selection

21. Performance for Replacement Strategies

22. Write Strategy What happens on a write? First, how often do writes occur? Common mix: 10% stores, 37% loads (data accesses) All instruction fetches (100%) are reads Writes: 10% / (100% + 10% + 37%) = 7% of all memory accesses 10% / (10% + 37%) = 21% of all data accesses

23. Writes are Inherently Slower Reads can occur concurrently with tag comparison (result is ignored if it is a miss) Entire block is always read For Writes: tag must be checked for hit before modification of block occurs Size of write must be specified, and only that part of block can be changed.

24. Write Policy Options Write through � information is written to both block in cache AND block in Memory Write back � information is written only to the block in the cache. Modified cache block is written to Memory when it is replaced.

25. Write Back Dirty bit � status bit to indicate cache block has been modified and differs from Memory Writes are faster, since a write to Memory is only needed when block is replaced Less Memory writes, so less power consumed Important for embedded operations

26. Write Through Easier to implement No writes to Memory necessary on read misses Cache is always consistent with Memory Important for multiprocessors Important for I/O

27. Optimizations for Writes Write buffer � holds data to be written so that CPU execution and Memory write can occur concurrently Reduces stall time for write through Writes can still result in some stall time if buffer is full

28. Write Miss Strategies Options for write misses: Write allocate � block allocated on a write miss � often used with write-back No-write allocate � block not allocated on a write miss � often used with write-through

29. Example: Alpha 21264 Data Cache 64KB data cache 64-byte blocks 2-way set associative 9-bit index selects among 512 sets 2 blocks per set x 512 sets x 64 bytes/block = 64KB Write-back Write allocate on write miss

30. Example: Alpha 21264 Data Cache

31. Cache Miss Cache sends a �data not available� to CPU 64-byte block is read from next level cache 2-way set associative cache, so 2 choices of places to put new block Round-robin selection (FIFO) is used Block replacement updates Block of data Address tag Valid bit Round-robin bit Write back is used, so replaced block must be written to next level. Written to �victim buffer� (same as write buffer) Unless victim buffer is full, write to next level from victim buffer takes place concurrently with CPU continuing execution. Write allocate is used, so block is allocated on write miss Read miss is very similar

32. Alpha Instruction Cache Separate data and instruction cache Avoids structural hazards when both instruction fetch and load/store need cache Allows for separate cache optimizations 64KB Instruction Cache Instruction caches have lower miss rates than data caches

33. Cache Performance CPU Execution Time = (CPU cycles + Memory stall cycles) x cycle time

34. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Memory protection

35. Cache Performance Effects of cache performance on overall execution time CPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycleCPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycle

36. Miss Penalties � Out of Order Execution Processors For out of order execution Miss penalty can overlap execution Redefine memory stalls to be miss penalty of non-overlapped latency

37. Summary of Cache Performance Figure 5.9 provides all previously presented equations � good reference CPU execution time Memory stall cycles Average memory access time Relationship between index size, cache size, block size, set associativity

38. Improving Cache Performance Four major categories of approaches Reducing the miss penalty Reducing the miss rate Reducing the miss rate or penalty via parallelism Reducing the time to hit in the cache

39. Reducing the Miss Penalty Multilevel caches

40. Multi-level Caches Local miss rate � miss rate of individual caches considering their individual requests: L1 Cache: Miss rateL1 L2 Cache: Miss rateL2 Global miss rate � overall miss rate of caches, considering all requests: L1 Cache: Miss rateL1 L2 Cache: Miss rateL1 x Miss rateL2 Average memory stalls per instruction = Misses per instructionL1 x hit timeL2 + Misses per instructionL2 x miss penaltyL2

41. Multi-level Caches Note that local miss rate is very different than global miss rate Local miss rate is very high for L2 caches smaller than L1 cache Global miss rate is very close to single cache miss rate Note that local miss rate is very different than global miss rate Local miss rate is very high for L2 caches smaller than L1 cache Global miss rate is very close to single cache miss rate

42. Design Issues for Second Level Caches Set associative? Usually this improves performance Block size? Usually kept the same as L1 If L2 can only be slightly bigger than L1: Multi-level exclusion option: L1 data is never found in L2. Cache miss in L1 results in swap of blocks between L1 and L2


44. Critical Word First Based on observation that CPU normally needs just one word of cache block Critical Word First � on a cache miss, read missed word from memory and send to CPU as soon as it arrives: fill rest of the words while CPU is executing Also called wrapped fetch and requested word first

45. Early Restart Similar to previous approach, but fetch words in order. As soon as requested word arrives, send to CPU. Both techniques are most beneficial when block size is large.

46. Giving Priority to Read Misses over Writes Serve reads before writes have completed Requires a write buffer Must avoid errors from memory RAW hazards

47. Giving Priority to Read Misses over Writes Wait for write buffer is empty (slow) Check write buffer contents on a miss (most common solution) Similar techniques for write-back cache When read miss will replace dirty block Copy dirty block to a buffer (to be written later) Then read memory Then write memory


49. Merging Write Buffer

50. Victim Caches

51. Summary of Miss Penalty Techniques Add: More cache levels Impatience: Critical word first, early restart Preference: Reads before writes Efficiency: Merging words in write buffer Recycling: Victim cache


53. Miss Categories Compulsory � first access to a block (cold-start misses or first-reference misses) Capacity � cache is full of other blocks (occurs in fully-associative caches) Conflict � for set associative or direct mapped caches � a miss because too many blocks map to a set or entry.

54. Miss Rates for Categories

55. Larger Block Size

56. Larger Caches An obvious way to reduce capacity misses Tradeoffs: Longer hit times Higher cost Most used in off-chip caches (L2, L3)

57. Higher Associativity Some observations: 8-way has similar performance to fully associative 2:1 rule of thumb: Direct-mapped cache of size N has about the same miss rate as 2-way set associative cache of N/2. Increased associativity increases hit time

58. Way Prediction Way prediction: extra bits are kept in the cache to predict the way Multiplexor is set early to select the desired block Only a single tag comparison is performed Miss results in checking other blocks Reduces conflict misses over direct mapping Maintains hit speed of direct mapping

59. Way Prediction in Alpha 21264 2-way set associative cache Block predictor bit added to each block of instruction cache Selects which of the 2 blocks to try on next cache access Correct � instruction latency 1 cycle Incorrect � try other block, change way predictor, instruction latency of 3 cycles

60. Compiler Optimizations So far to reduce misses: larger blocks, larger caches, higher associativity, way prediction � changes to hardware Some compiler techniques used: use profiling information to reorder instructions to avoid likely conflicts put program entry points at beginning of blocks Transform data storage so that program operates on data within blocks.

61. Compiler Optimizations Example: Exchange nesting of loops to make code access data in the order the data is stored.

62. Summary of Reducing Miss Rate Hardware approaches larger blocks, larger caches, higher associativity, way prediction Compiler approaches Yet another connection between architecture implementation and the compiler


64. Nonblocking Caches(lockup-free cache) Allows data cache to continue to supply cache hits during a miss �hit under miss� reduces effective miss penalty by responding to CPU hits during time waiting for miss �hit under multiple miss� overlap of hit with multiple misses Both require significant increases in cache controller complexity

65. Hardware Prefetching Can be done with data or instructions Directly into caches or into an external buffer Instruction prefetch Done with separate prefetch controller outside of cache On a miss, prefetch 2 blocks � the missed block and the one after it.

66. Hardware Prefetching

67. Hardware Prefetching

68. External Buffer Checked on cache miss before going to main memory If a hit, moved to cache Another prefetch replaces block in external buffer

69. Compiler Controlled Prefetching Compiler inserts �prefetch instructions� Prefetch data before it is needed Example

70. Compiler Controlled Prefetching Two approaches: Register prefetch �load data into CPU registers Cache prefetch �load data into cache only Two types Nonfaulting prefetch � turns into no-op if memory address is protected or causes page fault Also called non-binding prefetch Most modern cpus have nonfaulting cache prefetching


72. Small and Simple Caches Comparison of address index and cache tag most time consuming Direct-mapped caches are faster � can overlap tag checking with data transmission Pressure of matching cache access with CPU clock cycle has resulted in leveling off of L1 cache sizes in current computers

73. Associativity Impact

74. Pipelined Cache Access Allows effective latency to be multiple clock cycles Advantages Short clock cycle Slow hits Disadvantages Increases number of pipeline stages Greater branch penalties More cycles between issue of load and data availability

75. Trace Caches Instruction cache Cache blocks consist of traces of instructions Traces determined dynamically by CPU Traces include taken branches Used with branch prediction Complicated address mapping mechanisms Can store the same instruction multiple times

76. Cache Optimization Summary Figure 5.26 Most optimizations only help one factor (miss rate, miss penalty, hit time) Optimizations usually come with a hardware complexity cost

77. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Memory protection

78. Main Memory Optimizations Main memory latency = cache miss penalty (in time or cycles) Main memory bandwidth (bytes/time or cycles) Most important for I/O Important for multiprocessors Also important for L2 caches with large block sizes Bandwidth is easier to optimize than latency

79. Example Base Memory 4 clock cycles to send address 56 clock cycles for access time per word 4 clock cycles to send a word of data 64 clock cycles total Cache structure: 4 word blocks Word = 8-bytes Miss penalty 4x(4+56+4) = 256 clock cycles Bandwidth = 32/256 (1/8) bytes per clock cycle

80. Memory Optimization Options

81. Wider Memory Double width ? double throughput Cache miss ? 128 clock cycles (2 x 64) Throughput ? 32/128 = � bytes per cycle Multiplexor adds delay Complicates error correction within memory

82. Simple Interleaved Memory Memory organized into separate banks Address sent to all, reads occur in parallel Logically a wide memory, but accesses staged over time to share memory bus

83. Effect on Throughput Throughput for our example: 4 clock cycles to send address (in parallel) 56 clock cycles for access time (in parallel) 4x4 clock cycles to send a block of data Cache miss penalty =75 clock cycles Throughput = 32/75 = 0.4 bytes/cycle

84. Writes with Interleaved Memory Back to back writes can be pipelined If the accesses are to separate banks Avoids stalling while first wait to complete

85. Address Mapping for Interleaved Memory

86. Independent Memory Banks Allow multiple independent accesses Multiple memory controllers Banks operate somewhat independently More expensive, but more flexible Useful for non-blocking caches Useful for multi-processors

87. Overview Review memory hierarchy concepts Review cache definitions Cache performance Methods for improving cache performance Reducing miss penalty Reducing miss rate Reducing hit time Main memory performance Memory technology Virtual memory Examples

88. Memory Technology Memory latency access time � time between read request and data arrival cycle time � time between sequential requests not equivalent due to time required for setup of address lines Memory types DRAM SRAM

89. DRAM Dynamic RAM � must be refreshed to hold contents � contents are not static Information stored on capacitors Storage cells very small Large capacity Low cost per bit Slower than SRAM Usually packaged as a set of chips

90. DRAM Internal Organization

91. SRAM Static RAM � contents are statically held by feedback circuit 6 transistors per storage cell Good for low power devices (no refresh) Faster, but smaller, than DRAM Address lines not multiplexed No difference between access time and cycle time Used to implement caches

92. DRAM vs SRAM For similar technologies: Speed SRAM is 8-16 times faster Capacity DRAM is 4-8 times larger Cost SRAM is 8-16 times more expensive per bit

93. Embedded System Memory Issues Non-volatile storage (Desktops use disks) ROM � read-only memory Various types � some can be user programmed PROM, EPROM, EEPROM FLASH memory � similar to EEPROM Very slow to write (10-100 times slower than DRAM)

94. DRAM Performance Improvements Internal organizations Repeated accesses to a single row (fast page mode) Synchronous memory controller (SDRAM) Transfer data on both edges of DRAM clock signal Double Data Rate (DDR DRAM) All require some internal hardware overhead

95. DRAM Performance Improvements Interface bus designs RAMBUS (DRDRAM) Single chip that includes interleaved DDR SDRAM and high-speed interface Packaged in RIMM � same package size as DIMM � but incompatible Improved bandwidth, but not latency Expensive (2x DRAM DIMMs of same size)

96. Virtual Memory Analogous to CACHE/Main Memory Interface cache block similar to VM page (or segment) Management of main Memory/Secondary storage interface

97. Virtual Memory Manages sharing of protected memory space Relocation � mechanism for loading programs for execution from any location CPU produces virtual address Memory mapping (address translation) Translation of virtual address to physical address

98. Cache and Virtual Memory Comparison Virtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss ratesVirtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss rates

99. Memory Hierarchy Questions Where can a page be placed in memory? How is a page found if it is in main memory? Which page should be replaced in a virtual memory miss? What happens on a write?

100. Placement in Memory Very high miss penalty when reading from disk Must keep miss rate very low Pages can be placed anywhere in main memory (fully associative)

101. Finding Pages in Main Memory Indexed by page or segment number Page table maps virtual page to physical address

102. Address Translation Page table must be large enough to hold number of pages in virtual address space Translation Lookaside Buffer (TLB or TB) cache often used for page table TLB contains portions of the virtual address and the physical page frame number Includes dirty bits, use bits, protection field, and valid bits

103. Alpha 21264 TLB

104. Replacement Mechanisms To minimize page faults use LRU method Use bit (or reference bit) used to keep track Set whenever a page is accessed Recorded and cleared periodically by operating system

105. What Happens on Write? Write back is always used � disk access time is too long to consider write-through

106. Big Picture � Memory Hierarchy

107. Cache and Superscaler CPU Multi-issue CPUs require high bandwidth in cache Multiple cache ports for multiple instruction fetches per clock cycle Cache must be nonblocking (to allow hits over misses)

108. Speculative Execution and Memory Memory system must identify speculatively executed instructions Exceptions for speculative instructions suppressed Nonblocking cache so that stalls do not occur on miss for speculative instruction

109. Embedded System Memory Real time constraints require very little performance variability Caches increase variability (but improve average performance) Instruction caches are predictable and widely used Caches save power (on-chip access vs off-chip access) Way prediction can lower power by using only � of the comparators at a time.

110. I/O and Memory Problem of keeping cache consistent with Memory when I/O can write to Memory Connect I/O to cache (makes cache slow) Use write-through cache (not common for L2) Make I/O parts of memory non-cachable OS flushes cache blocks before accepting I/O Hardware controller to handle cache consistency with I/O

111. Example Memory Organizations Alpha 21264 out of order execution fetches up to 4 instructions per clock cycle Emotion Engine of Sony Playstation embedded processor high demands for audio and graphics Sun Fire 6800 Server commercial computing � database applications special features for availability and maintainability

113. Alpha 21264 41-bit physical address (43-bit virtual address) Physical address space: Instruction cache uses way prediction L1 miss penalty: 16 cycles L2 miss penalty: 130 cycles

114. Emotion Engine � Sony Playstation

115. Sony Playstation Lots of I/O to interface with memory 10-channel DMA (Direct Memory Access) I/O interface I/O processor with several interfaces Memory embedded with all processors Dedicated buses 1-cycle latency for all embedded memories

116. Interesting Features LARGE chip sizes: Emotion Engine: 225 mm2 Graphics Synthesizer: 279 mm2 (Alpha 21264 is about 160 mm2) 9 distinct independent memory modules Programmer (compiler) must keep memories consistent

117. Sun Fire 6800 Server Midrange multi-processor server Databases with large code sizes Many capacity and conflict misses Lots of context switching between processes Multiprocessor have cache coherency problem

118. Sun Fire 6800 Server

119. Interesting Features HUGE number of I/O pins (1368) Large number of wide memory paths per processor Peak bandwidth 11 GB/sec To improve latency, tags for L2 cache are on-chip. To improve reliability error correction bits 8-bit �back door� diagnostic bus redundant path between processors dual redundant system controllers

120. Review Chapters 1-4 (and Appendix A)

121. Computer Markets Desktop price range of under $1000 - $10,000 optimize price-performance Servers optimize availability and throughput (i.e. transactions per minute) scalability is important Embedded computers computers that are just one component of a larger system (cell phones, printers, network switches, etc) widest range of processing power and cost real-time performance requirements often power and memory must be minimized Three basic markets that computer architects design forThree basic markets that computer architects design for

122. Design � Functional Requirements Application area (desktop, servers, embedded computers) Level of software compatibility (programming level, binary compatible) Operating system requirements (address space size, memory management) Standards (floating point, I/O, networks, etc) At the top of the design process is determining the functional requirements for an architectureAt the top of the design process is determining the functional requirements for an architecture

123. Design - Technology Trends Integrated Circuit Technology Semiconductor DRAM Magnetic Disk Technology Network technology At the implementation end of the process, it is important to understand technology trends.At the implementation end of the process, it is important to understand technology trends.

124. IC Manufacturing Process

125. Computer Performance How do we measure it? Application run time Throughput � number of jobs per second Response time Importance of each term depends on the application. Application run time � normal PC user Throughput � server applications Response time � real time applications, transaction processingApplication run time � normal PC user Throughput � server applications Response time � real time applications, transaction processing

126. Measuring Performance Execution time � the actual time between the beginning and end of a program. Includes I/O, memory access, everything. Performance � reciprocal of execution time We will focus on execution timeWe will focus on execution time

127. Benchmarks Some typical benchmarks: Whetstone Dhrystone Benchmark suites � collections of benchmarks with different characteristics SPEC � Standard Performance Evaluation Corporation (www.spec.org) Many types (desktop, server, transaction processing, embedded computer)

128. Design Guidelines Make the common case fast when making design tradeoffs Amdahl�s Law:

129. CPU Performance Equations For a particular program: CPU time = CPU clock cycles x clock cycle time. clock cycle time = Considering instruction count: cycles per instruction (CPI) =

130. CPU Performance Equation CPU time = Instruction count x clock cycle time x CPI or CPU time = A 2x improvement in any of them is a 2x improvement in the CPU timeA 2x improvement in any of them is a 2x improvement in the CPU time

131. MIPS as Performance Measure MIPS � Millions Instructions Per Second Used in conjunction with benchmarks Dhrystone MIPS Can be computed as follows:

132. Locality Program property that is often exploited by architects to achieve better performance. Programs tend to reuse data and instructions. (For many programs, 90% of execution time is spent running 10% of the code) Temporal locality � recently accessed items are likely to be accessed in the near future. Spatial locality � items with addresses near one another tend to be referenced close together in time.

133. Parallelism One of the most important methods for improving performance. System level � multiple CPUs, multiple disks CPU level � pipelining, multiple functional modules Digital design level � carry-lookahead adders

134. Common Mistakes Using only clock rate to compare performance Even if the processor has the same instruction set, and same memory configuration, performance may not scale with clock speed.

135. Common Mistakes Comparing hand-coded assembly and compiler-generated, high-level language performance. Huge performance gains can be obtained by hand-coding critical loops of benchmark programs. Important to understand how benchmark code is generated when comparing performance.

136. Chapter 1 Review History of processor development Different processor markets Issues in cpu design Economics of cpu design and implementation Computer performance � measures, formulas, results Design guidelines � Amdahl�s law, locality, parallelism

137. Chapter 2 Overview Taxonomy of ISAs (Instruction Set Architectures) The Role of Compilers Example: The MIPS architecture Example: The Trimedia TM32 CPU

138. ISA Classification Based on operand location Memory addressing Addressing modes Type and size of operands Operations in the instruction set Instruction flow control Instruction set encoding

139. Four Basic Types

140. Trends Almost all modern processors use a load/store architecture Registers are faster to access than memory Registers are more convenient for compilers to use than other forms (like stacks) General purpose registers are more convenient for compilers than special purpose registers (like accumulators)

141. Conclusions about Memory Addressing Most important non-register addressing modes: displacement, immediate, register indirect Size of displacement field should be at least 12-16 bits. Size of immediate field should be at least 8-16 bits.

142. Flow Control Terminology �Transfer� instructions (old term) �Branch� � will be used for conditional flow control �Jump� � will be used for unconditional program flow control �Procedure calls� �Procedure returns�

143. Addressing Modes for Flow Control Flow control instructions must specify destination address PC-relative - Specify displacement from program counter (PC) Requires fewer bits than other modes Practical since branch target is often nearby Allows code to be independent of its location in memory (position-independence) Other modes must be used for returns and indirect jumps

144. The Anatomy of Compilers

145. Compilers and ISAs First goal is correctness Second goal is speed of resulting code Other goals: Speed of compilation Debugging support Interoperability with other languages First goal (correctness) is complex, and limits the complexity of optimizations.

146. Types of Optimizations High-level Optimization � on source code, fed to lower level optimizations Local optimizations � optimize code only within a straight-line code fragment Global optimizations � extend local optimizations across branches and apply transformations to optimize loops Register allocation � associate registers with operands Processor-dependent optimizations � take advantage of specific architecture features

147. Example: MIPS ISA 64-bit load-store architecture Full instruction set explained in Appendix C (on text web page)

148. MIPS Instructions Loads and Stores ALU operations Branches and Jumps Floating point operations

149. Floating Point Operations Single and double-precision operations indicated by .S and .D MOV.S and MOV.D copy registers of the same type Special instructions for moving data from FPR and GPR Conversion instructions convert integers to floating-point and visa-versa

150. Media Processor: Trimedia TM32 Dedicated to multimedia processing data communication audio coding video coding video processing graphics Operate on narrower data than PCs Operate on data streams Typically found in set-top boxes

151. Unique Features of TM32 Lots of registers: 128 (32-bit) Registers can be either integer or floating-point SIMD instructions available Both 2�s complement and saturating arithmetic available Programmer can specify five independent instructions to be issued at the same time! nops placed in slots if 5 are not available VLIW (very long instruction word) coding technique

152. Instruction Set Design: Pitfalls Designing a �high-level� instruction set feature specifically oriented to supporting a high-level language structure Often makes instructions too specialized to be useful Innovating at the ISA to reduce code size without accounting for the compiler Compilers can make much more impact Use optimized code when considering changes

153. Instruction Set Design: Fallacies There is such a thing as a typical program programs vary widely in how they use instruction sets An architecture with flaws cannot be successful 80x86 case in point A flawless architecture can be designed All designs contain tradeoffs Technologies change, making previous good decisions bad

154. ISA Conclusions: Trends in the 1990s Address size 32-bit ? 64-bit Addition of conditional execution instructions Optimization of cache performance via prefetch Support for multimedia Faster floating-point operations

155. ISA Conclusions: Trends in the 2000s Long instruction words Increased conditional execution Blending of DSP and general purpose architectures 80x86 emulation

156. Appendix A Overview Introduction Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline Pipeline Hazards Stalls, structural hazards, data hazards Branch hazards Pipeline Implementation Simple MIPS pipeline Implementation Difficulties for Pipelines Exceptions, instruction set complications Extending MIPS Pipeline to Multicycle operations Example: MIPS R4000 Pipeline

157. CPU Pipelining

158. Performance and Pipelining For the following assumptions: N stages in the pipeline Unpipelined execution time for 1 instruction is T Pipeline stages are equal and perfectly balanced Then Execution time for pipelined version = Throughput increase is N

159. Pipeline Execution

160. Instruction Timing Throughput is increased approximately by 5 Execution time of individual instruction INCREASES due to pipelining overhead Pipeline register delay Clock skew (T = TCL + Tsu + Treg + Tskew ) Important to balance pipeline stages, since clock is matched to slowest stage (TCL)

161. Pipeline Hazards Structural Hazards � resource conflicts when more than one instruction needs a resource Data Hazards � an instruction depends on a result from a previous instruction that is not yet available Control Hazards � conflicts from branches and jumps that change the PC

162. Forwarding Solution for hazards Also called �bypassing� or �short-circuiting� Create potential datapath from where result is calculated to where it is needed by another instruction Detect hazard to route the result Example:

163. Branch Hazards

164. Pipeline Implementation Details of pipeline implementation So that other issues can be explored Look at non-pipelined implementation first Focus on integer subset of MIPS Load-store word Branch equal zero Integer ALU operations Basic principles can be extended to all instructions

165. Multicycle Datapath

166. Pipeline Control

167. Branches in Pipeline Consider only BEQZ and BNEZ (branch if equal to zero or not equal to zero) For these it is possible to move the test to the ID stage To take advantage of early decision, target address must also be computed early Must add another adder for computing target address in ID Result is 1-cycle stall on branches. Branches on result of register from previous ALU operation will result in a data hazard stall.

168. Exceptions and Pipelines Exceptions can come from several sources and can be classified several ways Sources I/O Device Interrupt Invoking OS from user program Tracing program execution Breakpoint Integer arithmetic overflow or underflow, FP trap Page fault Misaligned memory accesses Memory protection violation Undefined instruction Hardware malfunction Power failure

169. FP Pipeline

170. Review of Hazards Caused by different lengths of execution unit pipelines. Structural hazards � multiple instructions need the same function unit at the same time RAW data hazards � Instruction needs to read a value that has not been written yet WAW data hazards � Writes occur out of order

171. MIPS R4000 Pipeline Implements MIPS-64 Deeper pipeline (8 stages) �Superpipeline� Higher clock rate (smaller logic in each stage) Additional stages from decomposing memory accesses

172. AppendixA Summary For ideal N-stage pipeline, throughput increase is N over a non-pipelined architecture Ideal pipelined cpu has CPI=1 Pipelining has advantages of significant speedup with moderate hardware costs invisible to programmer Pipeline challenges include Structural hazards Data hazards Control hazards Exceptions Floating point operations

173. Solutions include Stalls Forwarding Buffering state (for exceptions) Branch delay slots Branch prediction Several multi-cycle execution units for FP AppendixA Summary

174. Chapter 3 Overview Instruction Level Parallelism Data Dependence and Hazards Dynamic Scheduling Dynamic Hardware Prediction High-Performance Instruction Delivery Multiple Issue Hardware-Based Speculation

175. Instruction Level Parallelism Definition: Potential to overlap the execution of instructions Pipelining is one example Limitations of ILP are from data and control hazards Two approaches to overcoming limitations dynamic approaches with hardware (Chapter 3) static approaches that use software (Chapter 4)

176. CPI for Pipelines CPI = Ideal pipeline CPI + structural stalls + data hazard stalls + control stalls Pipeline performance is sometimes measured in IPC (Instructions Per Clock cycle) = 1/CPI Must fully understand dependencies and hazards to see how much parallelism exists and understand how it can be exploited.

177. Data Dependencies Data dependencies (true data dependencies) Name dependencies Control dependencies An instruction j is data dependent on instruction i if instruction i produces a result that may be used by instruction j instruction j is data dependent on k, which is data dependent on i.

178. Dynamic Scheduling Hardware rearranges instruction execution to reduce stalls due to dependencies Can handle some cases where dependencies are not known at compile time Simplifies the compiler Allows code compiled for one pipeline to be run on another (invisible to the compiler) Results in significant hardware complexity

179. Tomasulo�s Algorithm/Approach Approach requires tracking instruction dependencies to avoid RAW hazards Requires register renaming to avoid WAR and WAW hazards

180. MIPS FP Unit using Tomasulo�s Approach

181. Dynamic Hardware Prediction Using hardware to dynamically predict the outcome of a branch Prediction depends on the behavior of the branch at run time Effectiveness depends on two things: accuracy in predicting branch cost for correct and incorrect predictions

182. Multiple-Issue Processors Superscaler processors � issue varying numbers of instructions per clock statically scheduled or dynamically scheduled VLIW (Very Long Instruction Word) processors � issue a fixed number of instructions formatted as one large instruction or a fixed instruction packet - also called EPIC (Explicitly Parallel Instruction Computers) VLIW/EPIC are inherently statically scheduled

183. Example Multiple-Issue Processors

184. Statically Scheduled Superscaler Processors Instructions are issued in order All hazards checked dynamically at issue time Variable number of instructions issued per clock cycle Require the compiler techniques in Chapter 4 to be efficient.

185. Dynamically Scheduled Superscaler Processor Dynamic scheduling does not restrict the types of instructions that can be issued on a single clock cycle. Think of it as Tomasulo�s approach extended to support multiple-issue Allows N instructions to be issued whenever reservation stations are available. Branch prediction is used for fetch and issue (but not execute)

186. Speculative Superscaler Pipelines

187. Multiple Issue with Speculation For an architecture based on Tomasulo�s approach, the following are obstacles: Instruction Issue block Single CDB Add dispatch buffer Use multiport reorder buffer

188. Chapter 4 Overview Basic Compiler Techniques Pipeline scheduling loop unrolling Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP Detecting loop-level parallelism Software pipelining � symbolic loop unrolling Global code scheduling Hardware support for exposing more parallelism Conditional or predicted instructions Compiler speculation with hardware support Hardware vs Software speculation mechanisms Intel IA-64 ISA

189. Loop Unrolling Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming

190. Limits to Loop Unrolling Eventually the gains of removing loop overhead diminishes Remaining loop overhead amortization Code size limitations Embedded applications Increase in cache misses Compiler limitations Shortfall in registers Increase in number of �live values� past # registers

191. Detecting Parallelism Loop-level parallelism Analyzed at the source level requires recognition of array references, loops, indices. Loop-carried dependence � a dependence of one loop iteration on a previous iteration. for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + B[k]; }

192. Finding Dependencies Important for Efficient scheduling Determining which loops to unroll Eliminating name dependencies Makes finding dependencies difficult: Arrays and pointers in C or C++ Pass by reference parameter passing in FORTRAN

193. Dependencies in Arrays Array indices, i, are affine if: a�i+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine Common example of non-affine index: x[y[i]] (indirect array addressing) For two affine indices: a�i+b and c�k+d there is a dependence if: GCD(c,a) must divide (d-b) evenly

194. Software Pipelining Interleaves instructions from different iterations of a loop without unrolling each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulo�s algorithm start-up and finish-up code required

195. Software Pipelining

196. Global Code Scheduling Loop unrolling and software pipelining Improve ILP when loop bodies are straight-ling code (no branches) Control flow (branches) within loops makes both more complex. will require moving instructions across branches Global Code Scheduling � moving instructions across branches

197. Trace Scheduling Advantages eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior Disadvantages significant overhead in compensation code when trace must be exited

198. Review Loop unrolling Software pipelining Trace scheduling Global code scheduling Problems Unpredictable branches Dependencies between memory references

199. Compiler Speculation To speculate ambitiously, must have The ability to find instructions that can be speculatively moved and not affect program data flow. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.

200. Hardware vs Software Speculation Disambiguation of memory references Software � hard to do at compile time if program uses pointers Hardware � dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulo�s approach Support for speculative memory references can help compiler, but overhead of recovery is high.

201. Intel IA-64 ArchitectureItanium Implementation IA-64 Instruction set architecture Instruction format Examples of explicit parallelism support Predication and speculation support Itanium Implementation Functional units and instruction issue Performance

202. Conclusions Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity No clear �winner� in hardware or software approaches to ILP in general Software helps for conditional instructions and speculative load support Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness

chapter 5

chapter 5

Presentation Transcript

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5 5

chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

CHAPTER 5

Chapter 5

CHAPTER 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5