Instruction Set Architectures Performance issues ALUs Single Cycle CPU

The Story so far: • Instruction Set Architectures • Performance issues • ALUs • Single Cycle CPU • Multicycle CPU: datapath; control, Exceptions • Pipelining • Memory systems • Static/Dynamic RAM technologies • Cache structures: Direct mapped, Associative • Virtual Memory Tarun Soni, Summer’03

Memory Memory systems Tarun Soni, Summer’03

Memory Systems Computer Control Input Memory Output Datapath Tarun Soni, Summer’03

Technology Trends (from 1st lecture) Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1000:1! 2:1! Tarun Soni, Summer’03

Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Tarun Soni, Summer’03

Today’s Situation: Microprocessor • Processor Speeds • Intel Pentium III : 4GHz = • Memory Speeds • Mac/PC/Workstation DRAM: 50 ns • Disks are even slower.... • How can we span this “access time” gap? • 1 instruction fetch per instruction • 15/100 instructions also do a data read or write (load or store) • Rely on caches to bridge gap • Microprocessor-DRAM performance gap • time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions • 1/2X latency x 3X clock rate x 3X Instr/clock  5X Tarun Soni, Summer’03

Impact on Performance • Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle) • CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory ops get 50 cycle miss penalty • CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 • 58 % of the time the processor is stalled waiting for memory! • a 1% instruction miss rate would add an additional 0.5 cycles to the CPI! Tarun Soni, Summer’03

Memory system-hierarchical Processor Control Memory Memory Memory Datapath Memory Memory Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest Tarun Soni, Summer’03

Probability of reference 0 2^n - 1 Address Space Why hierarchy works • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. Tarun Soni, Summer’03

x x x t likely to reference x x likely reference zone address space Locality • Property of memory references in “typical programs” • Tendency to favor a portion of their address space at any given time • Temporal • Tendency to reference locations recently referenced • Spatial • Tendency to reference locations “near” those recently referenced Tarun Soni, Summer’03

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Memory Hierarchy: How Does it Work? • Temporal Locality(Locality in Time): => Keep most recently accessed data items closer to the processor • Spatial Locality(Locality in Space): => Move blocks consisting of contiguous words to the upper levels Tarun Soni, Summer’03

Memory Hierarchy: How Does it Work? • Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again. • This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories. • If it works, we get the illusion of SRAM access time with disk capacity SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte. DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte. Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte. Tarun Soni, Summer’03

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) • Hit Rate: the fraction of memory access found in the upper level • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) • Miss Rate= 1 - (Hit Rate) • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Tarun Soni, Summer’03

Memory Hierarchy of a Modern Computer System • By taking advantage of the principle of locality: • Present the user with as much memory as is available in the cheapest technology. • Provide access at the speed offered by the fastest technology. Processor Control Tertiary Storage (Disk) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) On-Chip Cache Datapath Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts Tarun Soni, Summer’03

Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms) • Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistor /bit) • Address not divided • Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16 Tarun Soni, Summer’03

Static RAM Cell 6-Transistor SRAM Cell word word (row select) 0 1 0 1 bit bit • Write: • 1. Drive bit lines (bit=1, bit=0) • 2.. Select row • Read: • 1. Precharge bit and bit to Vdd • 2.. Select row • 3. Cell pulls one line low • 4. Sense amp on column detects difference between bit and bit bit bit replaced with pullup to save area Tarun Soni, Summer’03

Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger - + - + - + - + - + - + - + - + Sense Amp Sense Amp Sense Amp Sense Amp Typical SRAM Organization: 16-word x 4-bit Din 3 Din 2 Din 1 Din 0 WrEn Precharge A0 Word 0 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A1 A2 Address Decoder Word 1 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A3 : : : : Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Dout 3 Dout 2 Dout 1 Dout 0 Tarun Soni, Summer’03

Problems with SRAM Select = 1 P1 P2 Off On On On On Off N1 N2 bit = 1 bit = 0 • Six transistors use up a lot of area • Consider a “Zero” is stored in the cell: • Transistor N1 will try to pull “bit” to 0 • Transistor P2 will try to pull “bit bar” to 1 • But bit lines are precharged to high: Are P1 and P2 necessary? Tarun Soni, Summer’03

A N 2 words N x M bit SRAM WE_L OE_L D M Logic Diagram of a Typical SRAM • Write Enable is usually active low (WE_L) • Din and Dout combined to save pins: • output enable (OE_L) (new signal) • WE_L asserted , OE_L deasserted • D => data input pin • WE_L deasserted , OE_L asserted • D => data output pin Read Timing: Write Timing: High Z D Data In Data Out Data Out Junk A Write Address Read Address Read Address OE_L WE_L Write Hold Time Read Access Time Read Access Time Write Setup Time Tarun Soni, Summer’03

1-Transistor Memory Cell (DRAM) row select • Write: • 1. Drive bit line • 2.. Select row • Read: • 1. Precharge bit line to Vdd • 2.. Select row • 3. Cell and bit line share charges • Very small voltage changes on the bit line • 4. Sense (fancy sense amp) • Can detect changes of ~1 million electrons • 5. Write: restore the value • Refresh • 1. Just do a dummy read to every cell. bit Tarun Soni, Summer’03

Classical DRAM Organization (square) bit (data) lines r o w d e c o d e r Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Column Selector & I/O Circuits Column Address row address • Row and Column Address together: • Select 1 bit a time data Tarun Soni, Summer’03

Logic Diagram of a Typical DRAM RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 • Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low • Din and Dout are combined (D): • WE_L is asserted (Low), OE_L is deasserted (High) • D serves as the data input pin • WE_L is deasserted (High), OE_L is asserted (Low) • D is the data output pin • Row and column addresses share the same pins (A) • RAS_L goes low: Pins A are latched in as row address • CAS_L goes low: Pins A are latched in as column address • RAS/CAS edge-sensitive Tarun Soni, Summer’03

Key DRAM Timing Parameters • tRAC: minimum time from RAS line falling to the valid data output. • Quoted as the speed of a DRAM • A fast 4Mb DRAM tRAC = 60 ns • tRC: minimum time from the start of one row access to the start of the next. • tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns • tCAC: minimum time from CAS line falling to valid data output. • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns • tPC: minimum time from the start of one column access to the start of the next. • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns Tarun Soni, Summer’03

DRAM Performance • A 60 ns (tRAC) DRAM can • perform a row access only every 110 ns (tRC) • perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). • In practice, external address delays and turning around buses make it 40 to 50 ns • These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. • Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins… • 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM Tarun Soni, Summer’03

RAS_L DRAM Write Timing • Every DRAM access begins at: • The assertion of the RAS_L • 2 ways to write: early or late v. CAS RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 DRAM WR Cycle Time CAS_L A Row Address Col Address Junk Row Address Col Address Junk OE_L WE_L D Junk Data In Junk Data In Junk WR Access Time WR Access Time Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L Tarun Soni, Summer’03

RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 RAS_L DRAM Read Timing • Every DRAM access begins at: • The assertion of the RAS_L • 2 ways to read: early or late v. CAS DRAM Read Cycle Time CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Data Out High Z Data Out Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L Tarun Soni, Summer’03

Cycle Time versus Access Time Cycle Time Access Time Time • DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time • 2:1; why? • DRAM (Read/Write) Cycle Time : • How frequent can you initiate an access? • Analogy: A little kid can only ask his father for money on Saturday • DRAM (Read/Write) Access Time: • How quickly will you get what you want once you initiate an access? • Analogy: As soon as he asks, his father will give him the money • DRAM Bandwidth Limitation analogy: • What happens if he runs out of money on Wednesday? Tarun Soni, Summer’03

Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: Memory Bank 1 CPU Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again Tarun Soni, Summer’03

32 8 8 2 4 1 8 2 4 1 8 2 Fewer DRAMs/System over Time DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb Memory per DRAM growth @ 60% / year 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 16 4 Memory per System growth @ 25%-30% / year Minimum PC Memory Size • Increasing DRAM => fewer chips => harder to have banks Tarun Soni, Summer’03

N cols RAS_L Page Mode DRAM: Motivation Column Address • Regular DRAM Organization: • N rows x N column x M-bit • Read & Write M-bit at a time • Each M-bit access requiresa RAS / CAS cycle • Fast Page Mode DRAM • N x M “register” to save a row DRAM Row Address N rows M bits M-bit Output 1st M-bit Access 2nd M-bit Access CAS_L A Row Address Col Address Junk Row Address Col Address Junk Tarun Soni, Summer’03

N cols Fast Page Mode Operation Column Address • Fast Page Mode DRAM • N x M “SRAM” to save a row • After a row is read into the register • Only CAS is needed to access other M-bit blocks on that row • RAS_L remains asserted while CAS_L is toggled DRAM Row Address N rows N x M “SRAM” M bits M-bit Output 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS_L CAS_L A Row Address Col Address Col Address Col Address Col Address Tarun Soni, Summer’03

DRAMs over Time DRAM Generation 1st Gen. Sample Memory Size Die Size (mm2) Memory Area (mm2) Memory Cell Area (µm2) ‘84 ‘87 ‘90 ‘93 ‘96 ‘99 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 55 85 130 200 300 450 30 47 72 110 165 250 28.84 11.1 4.26 1.64 0.61 0.23 Tarun Soni, Summer’03

Memory - Summary • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. • By taking advantage of the principle of locality: • Present the user with as much memory as is available in the cheapest technology. • Provide access at the speed offered by the fastest technology. • DRAM is slow but cheap and dense: • Good choice for presenting the user with a BIG memory system • SRAM is fast but expensive and not very dense: • Good choice for providing the user FAST access time. Tarun Soni, Summer’03

Cache Fundamentals • cache hit -- an access where the data • is found in the cache. • cache miss -- an access which isn’t • hit time -- time to access the cache • miss penalty -- time to move data from • further level to closer, then to cpu • hit ratio -- percentage of time the data is found in the • cache • miss ratio -- (1 - hit ratio) cpu lowest-level cache next-level memory/cache • cache block size or cache line size-- the • amount of data that gets transferred on a • cache miss. • instruction cache -- cache that only holds • instructions. • data cache -- cache that only caches data. • unified cache -- cache that holds both. Tarun Soni, Summer’03

Cacheing Issues cpu access lowest-level cache • On a memory access - • How do I know if this is a hit or miss? • On a cache miss - • where to put the new data? • what data to throw out? • how to remember what data this is? miss next-level memory/cache Tarun Soni, Summer’03

A simple cache : similar to the branch prediction table in pipelining ? an index is used to determine which line an address might be found in address string: 4 00000100 8 00001000 12 00001100 4 00000100 8 00001000 20 00010100 4 00000100 8 00001000 20 00010100 24 00011000 12 00001100 8 00001000 4 00000100 00000100 tag data 4 entries, each block holds one word, each word in memory maps to exactly one cache location. • A cache that can put a line of data in exactly one place is called direct-mapped • Conflict Misses are misses caused by: • Different memory locations mapped to the same cache index • Solution 1: make the cache size bigger • Solution 2: Multiple entries for the same Cache Index • Conflict misses = 0 by definition, for fully associative caches. Tarun Soni, Summer’03

Fully associative cache the tag identifies the address of the cached data address string: 4 00000100 8 00001000 12 00001100 4 00000100 8 00001000 20 00010100 4 00000100 8 00001000 20 00010100 24 00011000 12 00001100 8 00001000 4 00000100 tag data 4 entries, each block holds one word, any block can hold any word. • A cache that can put a line of data anywhere is called fully associative • The most popular replacement strategy is LRU (least recently used). • How do you find the data ? Tarun Soni, Summer’03

Fully associative cache • Fully Associative Cache • Forget about the Cache Index • Compare the Cache Tags of all cache entries in parallel • Example: Block Size = 2 B blocks, we need N 27-bit comparators • By definition: Conflict Miss = 0 for a fully associative cache 31 4 0 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X Tarun Soni, Summer’03

A n-way set associative cache address string: 4 00000100 8 00001000 12 00001100 4 00000100 8 00001000 20 00010100 4 00000100 8 00001000 20 00010100 24 00011000 12 00001100 8 00001000 4 00000100 00000100 tag data tag data 4 entries, each block holds one word, each word in memory maps to one of a set of n cache lines • A cache that can put a line of data in exactly n places is called n-way set-associative. • The cache lines that share the same index are a cache set. Tarun Soni, Summer’03

Cache Data Cache Tag Valid Cache Block 0 : : : Compare A n-way set associative cache • N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel • Example: Two-way set associative cache • Cache Index selects a “set” from the cache • The two tags in the set are compared in parallel • Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Block 0 : : : Adr Tag Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit Tarun Soni, Summer’03

Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 : : : : : : Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit Direct vs. Set Associative Caches • N-way Set Associative Cache versus Direct Mapped Cache: • N comparators vs. 1 • Extra MUX delay for the data • Data comes AFTER Hit/Miss decision and set selection • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: • Possible to assume a hit and continue. Recover later if miss. Tarun Soni, Summer’03

Longer cache-blocks address string: 4 00000100 8 00001000 12 00001100 4 00000100 8 00001000 20 00010100 4 00000100 8 00001000 20 00010100 24 00011000 12 00001100 8 00001000 4 00000100 00000100 tag data 4 entries, each block holds two words, each word in memory maps to exactly one cache location (this cache is twice the total size of the prior caches). • Large cache blocks take advantage of spatial locality. • Too large of a block size can waste cache space. • Longer cache blocks require less tag space Tarun Soni, Summer’03

Increasing block size • In general, larger block size take advantage of spatial locality BUT: • Larger block size means larger miss penalty: • Takes longer time to fill up the block • If block size is too big relative to cache size, miss rate will go up • Too few cache blocks • In general, Average Access Time: • = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Tarun Soni, Summer’03

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): first access to a block • “Cold” fact of life: not a whole lot you can do about it • Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant • Conflict (collision): • Multiple memory locations mappedto the same cache location • Solution 1: increase cache size • Solution 2: increase associativity • Capacity: • Cache cannot contain all blocks access by the program • Solution: increase cache size • Invalidation: other process (e.g., I/O) updates memory Tarun Soni, Summer’03

Sources of Cache Misses Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Invalidation Miss Same Same Same Tarun Soni, Summer’03

ICache Reg Dcache Reg ALU Accessing a cache 1. Use index and tag to access cache and determine hit/miss. 2. If hit, return requested data. 3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor). -load entire missed cache line into cache -return requested data to CPU (or higher cache) 4. If next lower memory is a cache, goto step 1 for that cache. IF ID EX MEM WB Tarun Soni, Summer’03

Accessing a cache • 64 KB cache, direct-mapped, 32 byte cache block size 31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 tag index word offset 3 11 16 tag data valid 0 1 2 ... ... ... ... 2045 2046 2047 64 KB / 32 bytes = 2 K cache blocks/sets 256 = 32 hit/miss Tarun Soni, Summer’03

Accessing a cache • 32 KB cache, 2-way set-associative, 16-byte block size (cache lines) 31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 tag index word offset 10 18 tag data tag data valid valid 0 1 2 ... ... ... ... 1021 1022 1023 32 KB / 16 bytes / 2 = 1 K cache sets = = hit/miss Tarun Soni, Summer’03

Cache Alignment memory address tag index block offset Memory • The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first). • This results in • no overlap of cache lines • easy mapping of addresses to cache lines (no additions) • data at address X always being present in the same location in the cache block (at byte X mod blocksize) if it is there at all. • Think of memory as organized into cache-line sized pieces (because in reality, it is!). • Recall DRAM page mode architecture !! 0 1 2 3 4 5 6 7 8 9 10 . . . . . . Tarun Soni, Summer’03

Basic Memory Hierarchy questions • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) • Placement: Cache structure determines the where ?: • Fully associative, direct mapped, 2-way set associative • S.A. Mapping = Block Number Modulo Number Sets • Identification: Tag on each block • No need to check index or block offset • Replacement: Easy for Direct Mapped • Set Associative or Fully Associative: • Random • LRU (Least Recently Used) Tarun Soni, Summer’03

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

Presentation Transcript

Instruction Set Architectures

Instruction Set Architectures Part 2

Language for Instruction Set Architectures

INSTRUCTION SET ARCHITECTURES

Instruction Set Architectures: History and Issues

Instruction Set Architectures

Instruction Set Issues

The single cycle CPU

Some Other Instruction Set Architectures

Single Cycle CPU

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

b1001 Single Cycle CPU

Instruction Set Architectures

INSTRUCTION SET ARCHITECTURES

Instruction Set Architectures

Single-Cycle CPU DataPath

Chapter 3 Instruction Set Architectures

Real instruction set architectures

Instruction Set Architectures

Instruction set architectures

Instruction Set Architectures Part 1

111 Single Cycle CPU