CPE555A: Real-Time Embedded Systems

CPE555A:Real-Time Embedded Systems Lecture 3 Ali Zaringhalam Stevens Institute of Technology

Administrative • Assignment 1 will be posted this week CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Outline • What is forwarding? • Memory hierarchy • Memory models CS555A – Real-Time Embedded Systems Stevens Institute of Technology

MIPS 5-Stage Integer Pipeline Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back M U X Zero? Cond. + NPC 4 M U X A PC ALU Output Regs ALU LMD Data Mem. M U X M U X B Inst. Mem. IR Sign Ext. Imm. 16 64 CS555A – Real-Time Embedded Systems Stevens Institute of Technology 4

MIPS Pipeline CC1 CC2 CC3 CC4 CC5 CC6 CC7 ALU ALU ALU ALU ALU ALU IM IM IM Reg Reg Reg DM DM DM Reg Reg Reg IF/ID ID/EX EX/MEM MEM/WB Intermediate registers introduce delay in the datapath Pipeline registers • Instruction memory (IM) and data memory (DM) are shown as separate units • All operations in a pipeline stage must complete in one clock cycle • Values passed from one stage to another must be stored in internal registers • Registers labeled with the names of stages they connect CS555A – Real-Time Embedded Systems Stevens Institute of Technology 5

CC1 CC2 CC3 CC4 CC5 Data Hazard ALU ALU ALU ALU ALU ALU IF IF IF IF IF Reg Reg Reg Reg Reg Mem Mem Mem Mem ALU ALU Reg Reg Reg ALU ALU IF/ID ID/EX EX/MEM MEM/WB ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

RAW Data Hazard ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 • R1 is not written back to the register file until the WB cycle (CC5) of ADD instruction • R1 is needed in the ID cycle of the succeeding instructions • CC3 for SUB • CC4 for AND • CC5 for OR • CC6 for XOR • Unless the hazard is handled, these instructions operate on the wrong operand value CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Split-Phase Register Read/Write CC1 CC2 CC3 CC4 CC5 CC6 CC7 ALU ALU ADD R1, R2, R3 IM IM Reg Reg Mem Mem OR R8, R1, R9 ALU ALU Reg IF/ID ID/EX EX/MEM MEM/WB 1st half 2nd half • XOR operates correctly because its ID cycle is in CC6 • OR can be made to operate correctly by: • Writing the register file in the first half of the clock cycle • Reading the register file in the second half of the clock cycle CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Forwarding (aka Bypassing) IF IF Reg Reg Mem Mem Reg Reg CC1 CC2 CC3 CC4 CC5 IF/ID ID/EX EX/MEM MEM/WB ALUout ADD R1, R2, R3 ALU ALU IF/ID ID/EX EX/MEM MEM/WB ALUout ALU ALU SUB R4, R1, R5 • The result is not needed by the SUB instruction in CC4. But the ADD instruction has actually computed the result in the previous cycle CC3 • Forward the result of ALU operations from the previous cycle • ALU results is written in the ALUout in the EX/MEM pipeline registers • If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX CS555A – Real-Time Embedded Systems Stevens Institute of Technology

CC1 CC2 CC3 CC4 CC5 CC6 CC7 ALU ALU IM IM IM IM Reg Reg Reg Reg DM DM DM Reg IF/ID ID/EX EX/MEM MEM/WB LW R1, 0(R2) Forwarding is now through MEM/WB IF/ID ID/EX EX/MEM IF/ID MEM/WB ALU ALU SUB R4, R1, R5 ALU AND R6, R1, R7 ALU OR no longer requires forwarding ALU OR R8, R1, R9 ALU • Stall in the Pipeline • No instruction begins in CC3 • No instruction completes in CC6 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Tabular View of Pipelining CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Assumptions Made To-Date • All memory operations take the same amount of time to complete • Each memory operation must complete before the next one can begin • Monolithic memory system: no structure CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Example: Perfect Memory ALU IF WB MEM ALU IF WB MEM ALU IF WB MEM ALU IF WB MEM • If every memory access takes 1 cycle, then, assuming our 5 stage pipeline, this program fragment takes 8 cycles ID ID ID ID CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Storage-Device Hierarchy 0.25-0.5 ns The CPU can access registers in one CPU clock cycle. 0.5-20 ns 80-250 ns 4 HZ CPU Cycle T=0.25 ns Increasing Access Time CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Example: Real Memory ALU WB bubble IF MEM bubble bubble IF bubble IF • If memory references take more than one cycle, then there will be a lot of stalls • Every instruction requires an instruction memory reference • Every Load or Store requires a memory reference ID ID CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Memory System Complexity • A typical system contains a mix of memory technologies • The faster the memory, the more expensive. In practice a memory hierarchy is used to get the right price/performance. • There is also a need for non-volatile memory that survives a reset. Example: executable program • Memory address space must be partitioned between I/O devices and various software needs such as stack and heap memory. CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Single-Transistor DRAM • Write: • Drive word line high (row select) • Drive bit line • Read: • Drive word line high (row select) • Capacitor connected to bit line • Output directed to multiplexer row select bit CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Bit lines Memory Chip Organization bit bit bit bit bit bit bit bit Decoder Word lines High half of address bits bit bit bit bit bit bit bit bit Low half of address bits Multiplexer Output CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Six-Transistor Static RAM Cell word (row select) 0 1 0 1 bit bit • Write: 1. Drive word line high 2. Drive bit lines • Read: 1. Drive word line high 2. Connect inverter outputs to bit lines 3. Result sent to input of multiplexer Once a value is stored in the cell, the ring structure of the inverter pair ensures the value circulates indefinitely as long as power is applied to the cell. Hence Static RAM (SRAM). CS555A – Real-Time Embedded Systems Stevens Institute of Technology

SRAM-DRAM Differences • Chip density • SRAM requires more transistors per cell compared to DRAM • higher cell density for DRAM: ~6-10x SRAM • Access time • SRAM uses active devices (inverters) to drive bit lines whereas DRAM uses a capacitor • bit lines driven faster by the stronger signal (higher current) in SRAM: ~10x faster than DRAM • Cost: SRAM more expensive than DRAM • Both SRAM and DRAM are volatile • They lose their content when powered-off CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Embedded System Memory • Most embedded systems include an SRAM memory (~100 K) • They could also have DRAM memory if more memory is needed and it is not cost-effective to supply it in SRAM • DRAM refresh • Over time charge on capacitor leaks • Capacitor would lose charge when cell is read • Cell must be refreshed to maintain its stored value • Refresh: a dummy read and write to every cell • A DRAM controller is used to refresh memory regularly • If memory is accessed during the refresh cycle of a cell, the memory controller stalls the CPU • The stall introduces variability in program execution CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Flash Memory • Flash: semiconductor, non-volatile memory • Compared to a hard disk • Lower latency • Lower power • Lighter weight, smaller size, shock resistance • Rough comparisons for DRAM:Flash:Disk • Cost per bit: 100:10:1 • Access latency: 1:5,000:1,000,000 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Types of Flash • NOR flash • Fast read (~100ns), slow writes (200usec), very slow erase (1sec) • 10K to 100K erase cycles • Used for instruction memory in mobile systems • NAND flash • Denser (bits/area, ~40% of NOR), cheaper per GB • Slow read (50usec), slow writes (200usec), slow erase (2msec) • 100K to 1M erase cycles • Used for data storage (phones, USB keys, solid-state drives) • Both types have durability issues • Damaged after some number of write/erase cycles CS555A – Real-Time Embedded Systems Stevens Institute of Technology

NAND Flash Chips • Page: minimum unit of read/write • 0.5Kb –8Kb of data + spare area for error coding • Block: minimum unit of erasing • 64 –128 pages • Chip: 1 –16GB CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Flash Operations • Read the contents of a page • 20-50us • Erase sets all bits in a block to 1 • Pages must be erased before they can be written • Update-in-place is not possible • 0.5-3ms • Write data to a page • Only 1-> 0 transitions are allowed • Writing within a block must be ordered by page • 100-300us CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Reliability • Wear out • Flash cells are physically damaged by read, write and erase operations • Writing disturb • Programming pages can corrupt the values of other pages in the block • Read disturb • Reading data can corrupt the data in the block • It takes many reads to see this effect • That’s why there is a spare area for error correction coding CS555A – Real-Time Embedded Systems Stevens Institute of Technology

The Principle of Locality • The Principle of Locality: • Programs access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality(Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality(Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight line code, array access) CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Memory Hierarchy • Present the user with as much memory as is available in the cheapest technology • Provide access at the speed offered by the fastest technology at the cost of the cheapest technology (on average) Processor Control Tertiary Storage (Disk/Tape) Secondary Storage (Disk) Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache ALU/ Datapath Registers ~ sec ~TBytes ~1s ns ~100 bytes ~10s ns ~Kbytes ~100s ns ~Mbytes ~10s ms ~Gbytes CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Accessing Data in Main Memory CPU • Ignore caches for the moment • Data access involves • Sending address to memory • Address indexes into memory • mem[address] • Data from mem[address] is returned to CPU • Memory is referenced just like an array • result <-- mem[index] Address Data mem[ ] 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 memory addresses Contents of memory CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Demand-Based System • Processor request is first looked up • in the top level of the hierarchy • If data cannot be found in the top • level, the next level is searched Processor Cache (SRAM) Block data transfer • Memory is copied from a lower • level to a higher level in blocks • of sequential address locations • it is faster to read/write blocks • than individual word • takes advantage of • locality of reference in programs Main Memory (DRAM) CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Block Size In Memory Hierarchy • Block (aka line): The minimum unit of information that can be transferred between cache & main memory (more generally between two adjacent layers in the memory hierarchy) storage size block size ~100 bytes 1 - 8 bytes Register Cache ~1 Kbytes 8-128 bytes ~1 Mbytes Main memory 512 - 4096 bytes ~1 Gbytes Disk ~ Mbytes Unlimited Tape CS555A – Real-Time Embedded Systems Stevens Institute of Technology

A Simple Cache Xn-2 Xn-2 Xn-1 Xn-1 X5 X6 X2 X2 X3 X4 X3 X4 Xn X1 X1 • Processor request: 1 word • Block size: 1 word Before the reference to Xn After the reference to Xn • Processor requests Xn which is not in the cache • Request results in a miss • Cache is full • Xn is brought from memory into cache replacing X6 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Cache Design Block size Block organization Direct-mapped Fully-associative Set-associative Block replacement policy FIFO LRU Random Write policy Writeback Write-through Write-allocate Write-no-allocate CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Cache Associativity 0 1 2 3 4 5 6 7 Fully associative: block 45 can go anywhere in cache (similar to an array). All entries must be checked. Direct-mapped: block 45 can go in only one location in cache: 45 mod 8 = 5 (Similar to a hash table). Only a single entry must be checked. 0 1 2 3 4 5 6 7 Set-associative: block 45 can go anywhere in one set in cache: 45 mod 4 = 1 (similar to a hash bucket). All Entries within a single set must be checked. 45 0 1 2 3 4 5 6 7 Set 0 Set 1 Main memory Set 2 Set 3 Cache CS555A – Real-Time Embedded Systems Stevens Institute of Technology

How To Find a Block? CPU address Block address Block offset Index Tag Selects the set. NULL for fully-associative cache. Compared against CPU address for hit/miss Selects data within a block CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Example Fully associative: 1 set. Eight blocks per set Index = 0. 8-way set associative. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Direct-mapped: 8 sets. One block per set. Index = 3 1-way set associative 45 0 1 2 3 4 5 6 7 Set 0 Set-associative: 4 sets. Two blocks per set Index = 2 2-way set associative Set 1 Main memory Set 2 Set 3 Cache CS555A – Real-Time Embedded Systems Stevens Institute of Technology

How To Find a Block? CPU address Block address Block offset Index Tag Given the number of bits in the index field (call it Index), the number of sets is = The number of blocks per set is = set associativity CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Example Block Block Way 1 Way 2 • Block size = 64 bytes • Cache size = 64 Kbytes • 2-way set associative • How many sets? • What is the size of the Index field Use approximation: 210=1000 Number of sets: 29=512 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Fully Associative Cache 5-bit offset supports 32 bytes per block Fully-associative cache does not need Set Index field. Index=0 corresponding to one set. • Address is partitioned into • Block number • Block offset which identifies the data • within the block • Block can go anywhere in the cache • Must examine all blocks • Each cache block has a Tag • Tag is compared to block number • If one of the blocks has Tag=Block # • we have a hit • Need a comparator per cache block • Comparisons performed in parallel CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Valid Bit • Initially (at power-up/cold-start) cache • is either empty or contains random data • Need a valid bit to indicate whether a • cache block contains valid data • A hit is only called against valid blocks CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Direct-Mapped Cache - 1 (Block Address) MOD n • Memory block can only be stored in one • of the cache lines • if two memory blocks map to the • same cache line, the old block must • be evicted to make room for the new • block • Address is again partitioned into • Block number • Block offset • Block number partitioned into • Set index which identifies the cache • line where the memory block may be • stored (set with a single entry) • Tag number which determines whether • there is a hit or not (only one comparison is necessary) CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Direct-Mapped Cache -Example • Block size = 32 bytes • need 5 bits for Block Offset • 512 lines in the cache • need 9 bits for Set Index • Cache size = 32 x 512 = 16 Kbytes • Set Index is used to index into the array • Tag is read out and input into the comparator • The comparator compares Tag in the address • and Tag stored in the cache’s Tag array CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Direct-Mapped Cache - 3 • Conceptually main memory can be viewed as a number of partitions with each partition size equal to the cache size • Address with the same set index map to the • same cache block (Block Address) MOD n Main Memory CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Example 100 1 1 100 1 • Consider the direct-mapped cache shown schematically • on the right.The following trace shows the memory references • sent to the cache in sequence. Assume that the cache starts • with all entries set to INVALID. • Mark each reference in the table as a Hit or Miss. • After the last reference, show the final Valid and Tag • fields in the cache diagram above. 101 1 1 101 CS555A – Real-Time Embedded Systems Stevens Institute of Technology

2-Way Set-Associative Cache -1 (Block Address) MOD (number of sets) • Each set holds two blocks (Way 0 & • Way 1) • Each memory block is • mapped to a set • can be stored in either cache line • Block size = 32 bytes • 5-bit Block Offset • 512 blocks divided into 256 sets with • two lines per set • need 8 bits for Set Index • Cache size: 256 x 2 x 16 = 16 Kbytes CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Hit Detection in 2-Way Cache CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Cache Associativity • High associativity • reduces conflicts between blocks that map to the same location • reduces “eviction rate” and miss rate • Low asscociativity • increases miss rate • reduces cache hardware complexity CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Replacement Policy • Replacement policy defines the algorithm to “evict” a block when there is a cache miss and the cache is full • Direct-mapped cache • trivial choice: evict resident block and replace with new block • Fully-associative & set-associative • random selection from among candidate blocks for eviction • simple to support in hardware • spreads allocation uniformly • least-recently-used (LRU): • access to blocks are recorded • evict the block that has gone unused the longest • improves chance of exploiting temporal locality • both methods comparable for large cache size CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Cache Write Policies • Write-back: data which is written by the processor is updated only in the cache but not in the lower level • data updated in the lower level only when a block is evicted • a block that requires update at eviction sets its “dirty” bit • multiple byte updates within a single block can be written to lower level in one write operation • must write entire block; it is not known which bytes must be updated • Write-through: update data in both the cache and the lower level • a read miss does not require updating the evicted block in the lower level because the lower level is already updated • only updated bytes within the block must be written to the lower level • less complex but consumes more memory bus bandwidth CS555A – Real-Time Embedded Systems Stevens Institute of Technology

Write-Through Cache Issues Cache Main Memory Processor Write buffer • Processor must wait for write to lower level to complete • this is referred to as write-stall • A write buffer (aka store buffer) is used to reduce write stall • processor writes data to the write buffer • memory controller writes content of write buffer to memory hierarchy • Effective only if (store frequency) << (1/DRAM write cycle) • If (store frequency) ~ (1/DRAM write cycle) write buffer eventually overflows • On cache miss must lookup write buffer as well CS555A – Real-Time Embedded Systems Stevens Institute of Technology

CPE555A: Real-Time Embedded Systems