720 likes | 953 Views
Overview. Review memory hierarchy conceptsReview cache definitionsCache performanceMethods for improving cache performanceReducing miss rateReducing miss penaltyReducing hit timeMain memory performanceMemory technologyVirtual memoryMemory protection. Principle of Locality. Programs access a relatively small area of address space at any time. (90% of the time in 10% of the code)Temporal locality
E N D
1. Chapter 5 Memory Hierarchy Design
3. Principle of Locality Programs access a relatively small area of address space at any time. (90% of the time in 10% of the code)
Temporal locality if a memory location is accessed, it is likely to be accessed again soon.
Spatial locality if a memory location is accessed, nearly locations are likely to be accessed.
4. Memory Hierarchy
5. Memory Hierarchy Implementation
6. Memory and CPU Performance Gap
7. Cache Basics Cache a term that generally applies whenever buffering is used to reuse commonly used items.
Cache first level of memory hierarchy encountered by the PC.
Cache hit when the CPU finds an item in cache
Cache miss when the CPU does not find the item in cache.
8. Cache Miss On a cache miss, must look for item in Memory (or another level of cache)
If item is not in memory page fault, page of the item must be retrieved from disk.
9. Cache Performance Memory stall cycles cycles spent waiting for memory access.
10. Cache Performance Cache Misses
11. Memory Stall Cycles Memory reads and writes will have different penalties
Usually we use the average of reads and writes and their penalties
12. Memory Hierarchy Operation Where can a block be placed in the upper level? (block placement)
How is a block found if it is in the upper level? (block identification)
Which block should be replaced on a miss? (block replacement)
What happens on a write? (write strategy)
13. Block Placement Where can a block be placed in the upper level?
Each block has only one place it can appear in the cache - Direct mapped cache
A block can be placed anywhere in the cache Fully associative cache
A block can be placed in a restricted set of places in the cache Set associative cache
14. Block Placement
15. Direct Mapped CacheExample: 32-bit Memory Address
16. Set Associative Cache Set is a group of blocks in the cache
Block is first mapped to set, then the block can be placed anywhere within the set.
For n blocks per set, cache is said to be n-way set associative.
17. Set Associative CacheExample: 2-way set associative
18. Fully Associate If cache has N blocks, then there are N possible places to put each memory block.
Must compare tags of every block to memory block address.
19. Block Identification How is a block found if it is in the cache?
20. Block Replacement Which block should be replaced on a cache miss?
Direct mapped cache no choice, only one spot for block
Set or fully-associative many blocks to choose from
Random to spread allocation uniformly
Least recently used (LRU) to reduce the chance of losing blocks that might be needed soon
First-in, first-out (FIFO) uses OLDEST block rather than least recently used to simplify selection
21. Performance for Replacement Strategies
22. Write Strategy What happens on a write?
First, how often do writes occur?
Common mix:
10% stores, 37% loads (data accesses)
All instruction fetches (100%) are reads
Writes:
10% / (100% + 10% + 37%) = 7% of all memory accesses
10% / (10% + 37%) = 21% of all data accesses
23. Writes are Inherently Slower Reads can occur concurrently with tag comparison (result is ignored if it is a miss)
Entire block is always read
For Writes: tag must be checked for hit before modification of block occurs
Size of write must be specified, and only that part of block can be changed.
24. Write Policy Options Write through information is written to both block in cache AND block in Memory
Write back information is written only to the block in the cache. Modified cache block is written to Memory when it is replaced.
25. Write Back Dirty bit status bit to indicate cache block has been modified and differs from Memory
Writes are faster, since a write to Memory is only needed when block is replaced
Less Memory writes, so less power consumed
Important for embedded operations
26. Write Through Easier to implement
No writes to Memory necessary on read misses
Cache is always consistent with Memory
Important for multiprocessors
Important for I/O
27. Optimizations for Writes Write buffer holds data to be written so that CPU execution and Memory write can occur concurrently
Reduces stall time for write through
Writes can still result in some stall time if buffer is full
28. Write Miss Strategies Options for write misses:
Write allocate block allocated on a write miss often used with write-back
No-write allocate block not allocated on a write miss often used with write-through
29. Example: Alpha 21264 Data Cache 64KB data cache
64-byte blocks
2-way set associative
9-bit index selects among 512 sets
2 blocks per set x 512 sets x 64 bytes/block = 64KB
Write-back
Write allocate on write miss
30. Example: Alpha 21264 Data Cache
31. Cache Miss Cache sends a data not available to CPU
64-byte block is read from next level cache
2-way set associative cache, so 2 choices of places to put new block
Round-robin selection (FIFO) is used
Block replacement updates
Block of data
Address tag
Valid bit
Round-robin bit
Write back is used, so replaced block must be written to next level.
Written to victim buffer (same as write buffer)
Unless victim buffer is full, write to next level from victim buffer takes place concurrently with CPU continuing execution.
Write allocate is used, so block is allocated on write miss
Read miss is very similar
32. Alpha Instruction Cache Separate data and instruction cache
Avoids structural hazards when both instruction fetch and load/store need cache
Allows for separate cache optimizations
64KB Instruction Cache
Instruction caches have lower miss rates than data caches
33. Cache Performance CPU Execution Time = (CPU cycles + Memory stall cycles) x cycle time
34. Overview Review memory hierarchy concepts
Review cache definitions
Cache performance
Methods for improving cache performance
Reducing miss penalty
Reducing miss rate
Reducing hit time
Main memory performance
Memory technology
Virtual memory
Memory protection
35. Cache Performance Effects of cache performance on overall execution time CPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle
CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycleCPU time with cache = IC x (1 + (1.5 x 2% x 100) x clock cycle
CPU time without cache = IC x (1 + (1.5 x 100)) x clock cycle
36. Miss Penalties Out of Order Execution Processors For out of order execution
Miss penalty can overlap execution
Redefine memory stalls to be miss penalty of non-overlapped latency
37. Summary of Cache Performance Figure 5.9 provides all previously presented equations good reference
CPU execution time
Memory stall cycles
Average memory access time
Relationship between index size, cache size, block size, set associativity
38. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
39. Reducing the Miss Penalty Multilevel caches
40. Multi-level Caches Local miss rate miss rate of individual caches considering their individual requests:
L1 Cache: Miss rateL1
L2 Cache: Miss rateL2
Global miss rate overall miss rate of caches, considering all requests:
L1 Cache: Miss rateL1
L2 Cache: Miss rateL1 x Miss rateL2
Average memory stalls per instruction = Misses per instructionL1 x hit timeL2
+ Misses per instructionL2 x miss penaltyL2
41. Multi-level Caches Note that local miss rate is very different than global miss rate
Local miss rate is very high for L2 caches smaller than L1 cache
Global miss rate is very close to single cache miss rate
Note that local miss rate is very different than global miss rate
Local miss rate is very high for L2 caches smaller than L1 cache
Global miss rate is very close to single cache miss rate
42. Design Issues for Second Level Caches Set associative? Usually this improves performance
Block size? Usually kept the same as L1
If L2 can only be slightly bigger than L1:
Multi-level exclusion option:
L1 data is never found in L2.
Cache miss in L1 results in swap of blocks between L1 and L2
43. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
44. Critical Word First Based on observation that CPU normally needs just one word of cache block
Critical Word First on a cache miss, read missed word from memory and send to CPU as soon as it arrives:
fill rest of the words while CPU is executing
Also called wrapped fetch and requested word first
45. Early Restart Similar to previous approach, but fetch words in order.
As soon as requested word arrives, send to CPU.
Both techniques are most beneficial when block size is large.
46. Giving Priority to Read Misses over Writes Serve reads before writes have completed
Requires a write buffer
Must avoid errors from memory RAW hazards
47. Giving Priority to Read Misses over Writes Wait for write buffer is empty (slow)
Check write buffer contents on a miss (most common solution)
Similar techniques for write-back cache
When read miss will replace dirty block
Copy dirty block to a buffer (to be written later)
Then read memory
Then write memory
48. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
49. Merging Write Buffer
50. Victim Caches
51. Summary of Miss Penalty Techniques Add: More cache levels
Impatience: Critical word first, early restart
Preference: Reads before writes
Efficiency: Merging words in write buffer
Recycling: Victim cache
52. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
53. Miss Categories Compulsory first access to a block (cold-start misses or first-reference misses)
Capacity cache is full of other blocks (occurs in fully-associative caches)
Conflict for set associative or direct mapped caches a miss because too many blocks map to a set or entry.
54. Miss Rates for Categories
55. Larger Block Size
56. Larger Caches An obvious way to reduce capacity misses
Tradeoffs:
Longer hit times
Higher cost
Most used in off-chip caches (L2, L3)
57. Higher Associativity Some observations:
8-way has similar performance to fully associative
2:1 rule of thumb: Direct-mapped cache of size N has about the same miss rate as 2-way set associative cache of N/2.
Increased associativity increases hit time
58. Way Prediction Way prediction: extra bits are kept in the cache to predict the way
Multiplexor is set early to select the desired block
Only a single tag comparison is performed
Miss results in checking other blocks
Reduces conflict misses over direct mapping
Maintains hit speed of direct mapping
59. Way Prediction in Alpha 21264 2-way set associative cache
Block predictor bit added to each block of instruction cache
Selects which of the 2 blocks to try on next cache access
Correct instruction latency 1 cycle
Incorrect try other block, change way predictor, instruction latency of 3 cycles
60. Compiler Optimizations So far to reduce misses:
larger blocks, larger caches, higher associativity, way prediction changes to hardware
Some compiler techniques used:
use profiling information to reorder instructions to avoid likely conflicts
put program entry points at beginning of blocks
Transform data storage so that program operates on data within blocks.
61. Compiler Optimizations Example: Exchange nesting of loops to make code access data in the order the data is stored.
62. Summary of Reducing Miss Rate Hardware approaches
larger blocks,
larger caches,
higher associativity,
way prediction
Compiler approaches
Yet another connection between architecture implementation and the compiler
63. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
64. Nonblocking Caches(lockup-free cache) Allows data cache to continue to supply cache hits during a miss
hit under miss
reduces effective miss penalty by responding to CPU hits during time waiting for miss
hit under multiple miss
overlap of hit with multiple misses
Both require significant increases in cache controller complexity
65. Hardware Prefetching Can be done with data or instructions
Directly into caches or into an external buffer
Instruction prefetch
Done with separate prefetch controller outside of cache
On a miss, prefetch 2 blocks the missed block and the one after it.
66. Hardware Prefetching
67. Hardware Prefetching
68. External Buffer Checked on cache miss before going to main memory
If a hit, moved to cache
Another prefetch replaces block in external buffer
69. Compiler Controlled Prefetching Compiler inserts prefetch instructions
Prefetch data before it is needed
Example
70. Compiler Controlled Prefetching Two approaches:
Register prefetch load data into CPU registers
Cache prefetch load data into cache only
Two types
Nonfaulting prefetch turns into no-op if memory address is protected or causes page fault
Also called non-binding prefetch
Most modern cpus have nonfaulting cache prefetching
71. Improving Cache Performance Four major categories of approaches
Reducing the miss penalty
Reducing the miss rate
Reducing the miss rate or penalty via parallelism
Reducing the time to hit in the cache
72. Small and Simple Caches Comparison of address index and cache tag most time consuming
Direct-mapped caches are faster can overlap tag checking with data transmission
Pressure of matching cache access with CPU clock cycle has resulted in leveling off of L1 cache sizes in current computers
73. Associativity Impact
74. Pipelined Cache Access Allows effective latency to be multiple clock cycles
Advantages
Short clock cycle
Slow hits
Disadvantages
Increases number of pipeline stages
Greater branch penalties
More cycles between issue of load and data availability
75. Trace Caches Instruction cache
Cache blocks consist of traces of instructions
Traces determined dynamically by CPU
Traces include taken branches
Used with branch prediction
Complicated address mapping mechanisms
Can store the same instruction multiple times
76. Cache Optimization Summary Figure 5.26
Most optimizations only help one factor (miss rate, miss penalty, hit time)
Optimizations usually come with a hardware complexity cost
77. Overview Review memory hierarchy concepts
Review cache definitions
Cache performance
Methods for improving cache performance
Reducing miss penalty
Reducing miss rate
Reducing hit time
Main memory performance
Memory technology
Virtual memory
Memory protection
78. Main Memory Optimizations Main memory latency = cache miss penalty (in time or cycles)
Main memory bandwidth (bytes/time or cycles)
Most important for I/O
Important for multiprocessors
Also important for L2 caches with large block sizes
Bandwidth is easier to optimize than latency
79. Example Base Memory 4 clock cycles to send address
56 clock cycles for access time per word
4 clock cycles to send a word of data
64 clock cycles total
Cache structure:
4 word blocks
Word = 8-bytes
Miss penalty
4x(4+56+4) = 256 clock cycles
Bandwidth = 32/256 (1/8) bytes per clock cycle
80. Memory Optimization Options
81. Wider Memory Double width ? double throughput
Cache miss ? 128 clock cycles (2 x 64)
Throughput ? 32/128 = bytes per cycle
Multiplexor adds delay
Complicates error correction within memory
82. Simple Interleaved Memory Memory organized into separate banks
Address sent to all, reads occur in parallel
Logically a wide memory, but accesses staged over time to share memory bus
83. Effect on Throughput Throughput for our example:
4 clock cycles to send address (in parallel)
56 clock cycles for access time (in parallel)
4x4 clock cycles to send a block of data
Cache miss penalty =75 clock cycles
Throughput = 32/75 = 0.4 bytes/cycle
84. Writes with Interleaved Memory Back to back writes can be pipelined
If the accesses are to separate banks
Avoids stalling while first wait to complete
85. Address Mapping for Interleaved Memory
86. Independent Memory Banks Allow multiple independent accesses
Multiple memory controllers
Banks operate somewhat independently
More expensive, but more flexible
Useful for non-blocking caches
Useful for multi-processors
87. Overview Review memory hierarchy concepts
Review cache definitions
Cache performance
Methods for improving cache performance
Reducing miss penalty
Reducing miss rate
Reducing hit time
Main memory performance
Memory technology
Virtual memory
Examples
88. Memory Technology Memory latency
access time time between read request and data arrival
cycle time time between sequential requests
not equivalent due to time required for setup of address lines
Memory types
DRAM
SRAM
89. DRAM Dynamic RAM must be refreshed to hold contents contents are not static
Information stored on capacitors
Storage cells very small
Large capacity
Low cost per bit
Slower than SRAM
Usually packaged as a set of chips
90. DRAM Internal Organization
91. SRAM Static RAM contents are statically held by feedback circuit
6 transistors per storage cell
Good for low power devices (no refresh)
Faster, but smaller, than DRAM
Address lines not multiplexed
No difference between access time and cycle time
Used to implement caches
92. DRAM vs SRAM For similar technologies:
Speed
SRAM is 8-16 times faster
Capacity
DRAM is 4-8 times larger
Cost
SRAM is 8-16 times more expensive per bit
93. Embedded System Memory Issues Non-volatile storage (Desktops use disks)
ROM read-only memory
Various types some can be user programmed
PROM, EPROM, EEPROM
FLASH memory similar to EEPROM
Very slow to write (10-100 times slower than DRAM)
94. DRAM Performance Improvements Internal organizations
Repeated accesses to a single row (fast page mode)
Synchronous memory controller (SDRAM)
Transfer data on both edges of DRAM clock signal
Double Data Rate (DDR DRAM)
All require some internal hardware overhead
95. DRAM Performance Improvements Interface bus designs
RAMBUS (DRDRAM)
Single chip that includes interleaved DDR SDRAM and high-speed interface
Packaged in RIMM same package size as DIMM but incompatible
Improved bandwidth, but not latency
Expensive (2x DRAM DIMMs of same size)
96. Virtual Memory Analogous to CACHE/Main Memory Interface
cache block similar to VM page (or segment)
Management of main Memory/Secondary storage interface
97. Virtual Memory Manages sharing of protected memory space
Relocation mechanism for loading programs for execution from any location
CPU produces virtual address
Memory mapping (address translation)
Translation of virtual address to physical address
98. Cache and Virtual Memory Comparison Virtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss ratesVirtual memory is larger, with larger page sizes, much longer access times, but also much smaller miss rates
99. Memory Hierarchy Questions Where can a page be placed in memory?
How is a page found if it is in main memory?
Which page should be replaced in a virtual memory miss?
What happens on a write?
100. Placement in Memory Very high miss penalty when reading from disk
Must keep miss rate very low
Pages can be placed anywhere in main memory (fully associative)
101. Finding Pages in Main Memory Indexed by page or segment number
Page table maps virtual page to physical address
102. Address Translation Page table must be large enough to hold number of pages in virtual address space
Translation Lookaside Buffer (TLB or TB) cache often used for page table
TLB contains portions of the virtual address and the physical page frame number
Includes dirty bits, use bits, protection field, and valid bits
103. Alpha 21264 TLB
104. Replacement Mechanisms To minimize page faults use LRU method
Use bit (or reference bit) used to keep track
Set whenever a page is accessed
Recorded and cleared periodically by operating system
105. What Happens on Write? Write back is always used disk access time is too long to consider write-through
106. Big Picture Memory Hierarchy
107. Cache and Superscaler CPU Multi-issue CPUs require high bandwidth in cache
Multiple cache ports for multiple instruction fetches per clock cycle
Cache must be nonblocking (to allow hits over misses)
108. Speculative Execution and Memory Memory system must identify speculatively executed instructions
Exceptions for speculative instructions suppressed
Nonblocking cache so that stalls do not occur on miss for speculative instruction
109. Embedded System Memory Real time constraints require very little performance variability
Caches increase variability (but improve average performance)
Instruction caches are predictable and widely used
Caches save power (on-chip access vs off-chip access)
Way prediction can lower power by using only of the comparators at a time.
110. I/O and Memory Problem of keeping cache consistent with Memory when I/O can write to Memory
Connect I/O to cache (makes cache slow)
Use write-through cache (not common for L2)
Make I/O parts of memory non-cachable
OS flushes cache blocks before accepting I/O
Hardware controller to handle cache consistency with I/O
111. Example Memory Organizations Alpha 21264
out of order execution
fetches up to 4 instructions per clock cycle
Emotion Engine of Sony Playstation
embedded processor
high demands for audio and graphics
Sun Fire 6800 Server
commercial computing database applications
special features for availability and maintainability
113. Alpha 21264 41-bit physical address (43-bit virtual address)
Physical address space:
Instruction cache uses way prediction
L1 miss penalty: 16 cycles
L2 miss penalty: 130 cycles
114. Emotion Engine Sony Playstation
115. Sony Playstation Lots of I/O to interface with memory
10-channel DMA (Direct Memory Access)
I/O interface
I/O processor with several interfaces
Memory embedded with all processors
Dedicated buses
1-cycle latency for all embedded memories
116. Interesting Features LARGE chip sizes:
Emotion Engine: 225 mm2
Graphics Synthesizer: 279 mm2
(Alpha 21264 is about 160 mm2)
9 distinct independent memory modules
Programmer (compiler) must keep memories consistent
117. Sun Fire 6800 Server Midrange multi-processor server
Databases with large code sizes
Many capacity and conflict misses
Lots of context switching between processes
Multiprocessor have cache coherency problem
118. Sun Fire 6800 Server
119. Interesting Features HUGE number of I/O pins (1368)
Large number of wide memory paths per processor
Peak bandwidth 11 GB/sec
To improve latency, tags for L2 cache are on-chip.
To improve reliability
error correction bits
8-bit back door diagnostic bus
redundant path between processors
dual redundant system controllers
120. Review Chapters 1-4 (and Appendix A)
121. Computer Markets Desktop
price range of under $1000 - $10,000
optimize price-performance
Servers
optimize availability and throughput (i.e. transactions per minute)
scalability is important
Embedded computers
computers that are just one component of a larger system (cell phones, printers, network switches, etc)
widest range of processing power and cost
real-time performance requirements
often power and memory must be minimized
Three basic markets that computer architects design forThree basic markets that computer architects design for
122. Design Functional Requirements Application area (desktop, servers, embedded computers)
Level of software compatibility (programming level, binary compatible)
Operating system requirements (address space size, memory management)
Standards (floating point, I/O, networks, etc) At the top of the design process is determining the functional requirements for an architectureAt the top of the design process is determining the functional requirements for an architecture
123. Design - Technology Trends Integrated Circuit Technology
Semiconductor DRAM
Magnetic Disk Technology
Network technology At the implementation end of the process, it is important to understand technology trends.At the implementation end of the process, it is important to understand technology trends.
124. IC Manufacturing Process
125. Computer Performance How do we measure it?
Application run time
Throughput number of jobs per second
Response time
Importance of each term depends on the application. Application run time normal PC user
Throughput server applications
Response time real time applications, transaction processingApplication run time normal PC user
Throughput server applications
Response time real time applications, transaction processing
126. Measuring Performance Execution time the actual time between the beginning and end of a program. Includes I/O, memory access, everything.
Performance reciprocal of execution time
We will focus on execution timeWe will focus on execution time
127. Benchmarks Some typical benchmarks:
Whetstone
Dhrystone
Benchmark suites collections of benchmarks with different characteristics
SPEC Standard Performance Evaluation Corporation (www.spec.org)
Many types (desktop, server, transaction processing, embedded computer)
128. Design Guidelines Make the common case fast when making design tradeoffs
Amdahls Law:
129. CPU Performance Equations For a particular program:
CPU time = CPU clock cycles x clock cycle time.
clock cycle time =
Considering instruction count:
cycles per instruction (CPI) =
130. CPU Performance Equation CPU time = Instruction count x clock cycle time x CPI
or
CPU time = A 2x improvement in any of them is a 2x improvement in the CPU timeA 2x improvement in any of them is a 2x improvement in the CPU time
131. MIPS as Performance Measure MIPS Millions Instructions Per Second
Used in conjunction with benchmarks
Dhrystone MIPS
Can be computed as follows:
132. Locality Program property that is often exploited by architects to achieve better performance.
Programs tend to reuse data and instructions. (For many programs, 90% of execution time is spent running 10% of the code)
Temporal locality recently accessed items are likely to be accessed in the near future.
Spatial locality items with addresses near one another tend to be referenced close together in time.
133. Parallelism One of the most important methods for improving performance.
System level multiple CPUs, multiple disks
CPU level pipelining, multiple functional modules
Digital design level carry-lookahead adders
134. Common Mistakes Using only clock rate to compare performance
Even if the processor has the same instruction set, and same memory configuration, performance may not scale with clock speed.
135. Common Mistakes Comparing hand-coded assembly and compiler-generated, high-level language performance.
Huge performance gains can be obtained by hand-coding critical loops of benchmark programs.
Important to understand how benchmark code is generated when comparing performance.
136. Chapter 1 Review History of processor development
Different processor markets
Issues in cpu design
Economics of cpu design and implementation
Computer performance measures, formulas, results
Design guidelines Amdahls law, locality, parallelism
137. Chapter 2 Overview Taxonomy of ISAs (Instruction Set Architectures)
The Role of Compilers
Example: The MIPS architecture
Example: The Trimedia TM32 CPU
138. ISA Classification Based on operand location
Memory addressing
Addressing modes
Type and size of operands
Operations in the instruction set
Instruction flow control
Instruction set encoding
139. Four Basic Types
140. Trends Almost all modern processors use a load/store architecture
Registers are faster to access than memory
Registers are more convenient for compilers to use than other forms (like stacks)
General purpose registers are more convenient for compilers than special purpose registers (like accumulators)
141. Conclusions about Memory Addressing Most important non-register addressing modes:
displacement, immediate, register indirect
Size of displacement field should be at least 12-16 bits.
Size of immediate field should be at least 8-16 bits.
142. Flow Control Terminology
Transfer instructions (old term)
Branch will be used for conditional flow control
Jump will be used for unconditional program flow control
Procedure calls
Procedure returns
143. Addressing Modes for Flow Control Flow control instructions must specify destination address
PC-relative - Specify displacement from program counter (PC)
Requires fewer bits than other modes
Practical since branch target is often nearby
Allows code to be independent of its location in memory (position-independence)
Other modes must be used for returns and indirect jumps
144. The Anatomy of Compilers
145. Compilers and ISAs First goal is correctness
Second goal is speed of resulting code
Other goals:
Speed of compilation
Debugging support
Interoperability with other languages
First goal (correctness) is complex, and limits the complexity of optimizations.
146. Types of Optimizations High-level Optimization on source code, fed to lower level optimizations
Local optimizations optimize code only within a straight-line code fragment
Global optimizations extend local optimizations across branches and apply transformations to optimize loops
Register allocation associate registers with operands
Processor-dependent optimizations take advantage of specific architecture features
147. Example: MIPS ISA
64-bit load-store architecture
Full instruction set explained in Appendix C (on text web page)
148. MIPS Instructions Loads and Stores
ALU operations
Branches and Jumps
Floating point operations
149. Floating Point Operations Single and double-precision operations indicated by .S and .D
MOV.S and MOV.D copy registers of the same type
Special instructions for moving data from FPR and GPR
Conversion instructions convert integers to floating-point and visa-versa
150. Media Processor: Trimedia TM32 Dedicated to multimedia processing
data communication
audio coding
video coding
video processing
graphics
Operate on narrower data than PCs
Operate on data streams
Typically found in set-top boxes
151. Unique Features of TM32 Lots of registers: 128 (32-bit)
Registers can be either integer or floating-point
SIMD instructions available
Both 2s complement and saturating arithmetic available
Programmer can specify five independent instructions to be issued at the same time!
nops placed in slots if 5 are not available
VLIW (very long instruction word) coding technique
152. Instruction Set Design: Pitfalls Designing a high-level instruction set feature specifically oriented to supporting a high-level language structure
Often makes instructions too specialized to be useful
Innovating at the ISA to reduce code size without accounting for the compiler
Compilers can make much more impact
Use optimized code when considering changes
153. Instruction Set Design: Fallacies There is such a thing as a typical program
programs vary widely in how they use instruction sets
An architecture with flaws cannot be successful
80x86 case in point
A flawless architecture can be designed
All designs contain tradeoffs
Technologies change, making previous good decisions bad
154. ISA Conclusions: Trends in the 1990s Address size 32-bit ? 64-bit
Addition of conditional execution instructions
Optimization of cache performance via prefetch
Support for multimedia
Faster floating-point operations
155. ISA Conclusions: Trends in the 2000s Long instruction words
Increased conditional execution
Blending of DSP and general purpose architectures
80x86 emulation
156. Appendix A Overview Introduction
Pipeline concepts
Basics of RISC instruction set
Classic 5-stage pipeline
Pipeline Hazards
Stalls, structural hazards, data hazards
Branch hazards
Pipeline Implementation
Simple MIPS pipeline
Implementation Difficulties for Pipelines
Exceptions, instruction set complications
Extending MIPS Pipeline to Multicycle operations
Example: MIPS R4000 Pipeline
157. CPU Pipelining
158. Performance and Pipelining For the following assumptions:
N stages in the pipeline
Unpipelined execution time for 1 instruction is T
Pipeline stages are equal and perfectly balanced
Then
Execution time for pipelined version =
Throughput increase is N
159. Pipeline Execution
160. Instruction Timing Throughput is increased approximately by 5
Execution time of individual instruction INCREASES due to pipelining overhead
Pipeline register delay
Clock skew (T = TCL + Tsu + Treg + Tskew )
Important to balance pipeline stages, since clock is matched to slowest stage (TCL)
161. Pipeline Hazards Structural Hazards resource conflicts when more than one instruction needs a resource
Data Hazards an instruction depends on a result from a previous instruction that is not yet available
Control Hazards conflicts from branches and jumps that change the PC
162. Forwarding Solution for hazards
Also called bypassing or short-circuiting
Create potential datapath from where result is calculated to where it is needed by another instruction
Detect hazard to route the result
Example:
163. Branch Hazards
164. Pipeline Implementation Details of pipeline implementation
So that other issues can be explored
Look at non-pipelined implementation first
Focus on integer subset of MIPS
Load-store word
Branch equal zero
Integer ALU operations
Basic principles can be extended to all instructions
165. Multicycle Datapath
166. Pipeline Control
167. Branches in Pipeline Consider only BEQZ and BNEZ (branch if equal to zero or not equal to zero)
For these it is possible to move the test to the ID stage
To take advantage of early decision, target address must also be computed early
Must add another adder for computing target address in ID
Result is 1-cycle stall on branches.
Branches on result of register from previous ALU operation will result in a data hazard stall.
168. Exceptions and Pipelines Exceptions can come from several sources and can be classified several ways
Sources
I/O Device Interrupt
Invoking OS from user program
Tracing program execution
Breakpoint
Integer arithmetic overflow or underflow, FP trap
Page fault
Misaligned memory accesses
Memory protection violation
Undefined instruction
Hardware malfunction
Power failure
169. FP Pipeline
170. Review of Hazards Caused by different lengths of execution unit pipelines.
Structural hazards multiple instructions need the same function unit at the same time
RAW data hazards Instruction needs to read a value that has not been written yet
WAW data hazards Writes occur out of order
171. MIPS R4000 Pipeline Implements MIPS-64
Deeper pipeline (8 stages) Superpipeline
Higher clock rate (smaller logic in each stage)
Additional stages from decomposing memory accesses
172. AppendixA Summary For ideal N-stage pipeline, throughput increase is N over a non-pipelined architecture
Ideal pipelined cpu has CPI=1
Pipelining has advantages of
significant speedup with moderate hardware costs
invisible to programmer
Pipeline challenges include
Structural hazards
Data hazards
Control hazards
Exceptions
Floating point operations
173. Solutions include
Stalls
Forwarding
Buffering state (for exceptions)
Branch delay slots
Branch prediction
Several multi-cycle execution units for FP
AppendixA Summary
174. Chapter 3 Overview Instruction Level Parallelism
Data Dependence and Hazards
Dynamic Scheduling
Dynamic Hardware Prediction
High-Performance Instruction Delivery
Multiple Issue
Hardware-Based Speculation
175. Instruction Level Parallelism Definition: Potential to overlap the execution of instructions
Pipelining is one example
Limitations of ILP are from data and control hazards
Two approaches to overcoming limitations
dynamic approaches with hardware (Chapter 3)
static approaches that use software (Chapter 4)
176. CPI for Pipelines CPI = Ideal pipeline CPI + structural stalls + data hazard stalls + control stalls
Pipeline performance is sometimes measured in IPC (Instructions Per Clock cycle) = 1/CPI
Must fully understand dependencies and hazards to see how much parallelism exists and understand how it can be exploited.
177. Data Dependencies Data dependencies (true data dependencies)
Name dependencies
Control dependencies
An instruction j is data dependent on instruction i if
instruction i produces a result that may be used by instruction j
instruction j is data dependent on k, which is data dependent on i.
178. Dynamic Scheduling Hardware rearranges instruction execution to reduce stalls due to dependencies
Can handle some cases where dependencies are not known at compile time
Simplifies the compiler
Allows code compiled for one pipeline to be run on another (invisible to the compiler)
Results in significant hardware complexity
179. Tomasulos Algorithm/Approach Approach requires tracking instruction dependencies to avoid RAW hazards
Requires register renaming to avoid WAR and WAW hazards
180. MIPS FP Unit using Tomasulos Approach
181. Dynamic Hardware Prediction Using hardware to dynamically predict the outcome of a branch
Prediction depends on the behavior of the branch at run time
Effectiveness depends on two things:
accuracy in predicting branch
cost for correct and incorrect predictions
182. Multiple-Issue Processors Superscaler processors issue varying numbers of instructions per clock
statically scheduled or
dynamically scheduled
VLIW (Very Long Instruction Word) processors issue a fixed number of instructions
formatted as one large instruction or
a fixed instruction packet - also called EPIC (Explicitly Parallel Instruction Computers)
VLIW/EPIC are inherently statically scheduled
183. Example Multiple-Issue Processors
184. Statically Scheduled Superscaler Processors Instructions are issued in order
All hazards checked dynamically at issue time
Variable number of instructions issued per clock cycle
Require the compiler techniques in Chapter 4 to be efficient.
185. Dynamically Scheduled Superscaler Processor Dynamic scheduling does not restrict the types of instructions that can be issued on a single clock cycle.
Think of it as Tomasulos approach extended to support multiple-issue
Allows N instructions to be issued whenever reservation stations are available.
Branch prediction is used for fetch and issue (but not execute)
186. Speculative Superscaler Pipelines
187. Multiple Issue with Speculation For an architecture based on Tomasulos approach, the following are obstacles:
Instruction Issue block
Single CDB
Add dispatch buffer
Use multiport reorder buffer
188. Chapter 4 Overview Basic Compiler Techniques
Pipeline scheduling
loop unrolling
Static Branch Prediction
Static Multiple Issue: VLIW
Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism
Software pipelining symbolic loop unrolling
Global code scheduling
Hardware support for exposing more parallelism
Conditional or predicted instructions
Compiler speculation with hardware support
Hardware vs Software speculation mechanisms
Intel IA-64 ISA
189. Loop Unrolling Eliminate some of the overhead by unrolling the loop (fully or partially).
Need to adjust the loop termination code
Allows more parallel instructions in a row
Allows more flexibility in reordering
Usually requires register renaming
190. Limits to Loop Unrolling Eventually the gains of removing loop overhead diminishes
Remaining loop overhead amortization
Code size limitations
Embedded applications
Increase in cache misses
Compiler limitations
Shortfall in registers
Increase in number of live values past # registers
191. Detecting Parallelism Loop-level parallelism
Analyzed at the source level
requires recognition of array references, loops, indices.
Loop-carried dependence a dependence of one loop iteration on a previous iteration.
for (k=1; k<=100; k=k+1) {
A[k+1] = A[k] + B[k];
}
192. Finding Dependencies Important for
Efficient scheduling
Determining which loops to unroll
Eliminating name dependencies
Makes finding dependencies difficult:
Arrays and pointers in C or C++
Pass by reference parameter passing in FORTRAN
193. Dependencies in Arrays Array indices, i, are affine if:
ai+ b (for a one-dimensional array)
Index of multiple-dimension arrays is affine if indices in each dimension are affine
Common example of non-affine index:
x[y[i]] (indirect array addressing)
For two affine indices: ai+b and ck+d there is a dependence if:
GCD(c,a) must divide (d-b) evenly
194. Software Pipelining Interleaves instructions from different iterations of a loop without unrolling
each iteration is made from instructions from different iterations of the loop
software counterpart to Tomasulos algorithm
start-up and finish-up code required
195. Software Pipelining
196. Global Code Scheduling Loop unrolling and software pipelining
Improve ILP when loop bodies are straight-ling code (no branches)
Control flow (branches) within loops makes both more complex.
will require moving instructions across branches
Global Code Scheduling moving instructions across branches
197. Trace Scheduling Advantages
eliminates some hard decisions in global code scheduling
good for code such as scientific programs with intensive loops and predictable behavior
Disadvantages
significant overhead in compensation code when trace must be exited
198. Review Loop unrolling
Software pipelining
Trace scheduling
Global code scheduling
Problems
Unpredictable branches
Dependencies between memory references
199. Compiler Speculation To speculate ambitiously, must have
The ability to find instructions that can be speculatively moved and not affect program data flow.
The ability to ignore exceptions in speculated instructions, until it is certain they should occur.
The ability to speculatively interchange loads and stores which may have address conflicts.
The last two require hardware support.
200. Hardware vs Software Speculation Disambiguation of memory references
Software hard to do at compile time if program uses pointers
Hardware dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulos approach
Support for speculative memory references can help compiler, but overhead of recovery is high.
201. Intel IA-64 ArchitectureItanium Implementation IA-64
Instruction set architecture
Instruction format
Examples of explicit parallelism support
Predication and speculation support
Itanium Implementation
Functional units and instruction issue
Performance
202. Conclusions Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity
No clear winner in hardware or software approaches to ILP in general
Software helps for conditional instructions and speculative load support
Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness