Advanced Computer Architecture: Instruction Fetch Techniques

ECE 4100/6100Advanced Computer ArchitectureLecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Execution Core Instruction Supply Issues • Fetch throughput defines max performance that can be achieved in later stages • Superscalar processors need to supply more than 1 instruction per cycle • Instruction Supply limited by • Misalignment of multiple instructions in a fetch group • Change of Flow (interrupting instruction supply) • Memory latency and bandwidth Instruction Fetch Unit Instruction buffer

Aligned Instruction Fetching (4 instructions) PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Cycle n Can pull out one row at a time inst 1 inst 2 inst 3 inst 4 Assume one fetch group = 16B

Misaligned Fetch PC=..xx001000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network Cycle n inst 1 inst 2inst 3 inst 4 IBM RS/6000

Split Cache Line Access PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 Cycle n inst 1 inst 2 Cycle n+1 inst 3 inst 4 Be broken down to 2 physical accesses

Split Cache Line Access Miss PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line Bmisses Cycle n inst 1 inst 2 Cycle n+X inst 3 inst 4

High Bandwidth Instruction Fetching • Wider issue  More instruction feed • Major challenge: to fetch more than one non-contiguous basic block per cycle • Enabling technique? • Predication • Branch alignment based on profiling • Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB6 BB7

Typical assembly Assembly w/ predication lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Predication Example • Convert control dependency into data dependency • Enlarge basic block size • More room for scheduling • No fetch disruption Source code if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0

Collapse Buffer[ISCA 95] • To fetch multiple (often non-contiguous) instructions • Use interleaved BTB to enable multiple branch predictions • Align instructions in the predicted sequential order • Use banked I-cache for multiple line access

Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

E F G H A B C D Bank Routing Interchange Switch E A A B C D E F G H Collapsing Circuit A B C D D F H E F G H A B C E G Collapsing Buffer Mechanism Interleaved BTB Valid Instruction Bits A E

High Bandwidth Instruction Fetching • To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) • Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB6 BB7

Multiple Branch Predictor [YehMarrPatt ICS’93] • Pattern History Table (PHT) design to support MBP • Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction

Multiple Branch Predictin • Fetch address could be retrieved from BTB • Predicted path: BB1  BB2  BB5 • How to fetch BB2 and BB5? BTB? • Can’t. Branch PCs of br1 and br2 not available when MBP made • Use a BAC design Fetch address (br0 Primary prediction) BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 T F F (3rd) T BB4 BB5 BB6 BB7

Branch Address Cache V br V br V br • Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions • br: 2 bits for branch type (cond, uncond, return) • V: single valid bit (to indicate if hits a branch in the sequence) • To make one more level prediction • Need to cache another 8 more addresses (i.e. total=14 addresses) • 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 1 23 bits 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB)

BB1 BB3 BB5 BB1 BB2 Caching Non-Consecutive Basic Blocks • High Fetch Bandwidth + Low Latency BB3 BB5 BB2 BB4 Fetch in Conventional Instruction Cache BB4 Fetch in Linear Memory Location

Trace Cache A B A B C A B C D E F G H I J C D E F G D H I J E F G A B C D E F G H I J H I J Collapsing Buffer Fetch (3 cycles) I$ Fetch (5 cycles) T$ Fetch (1 cycle) Trace Cache • Cache dynamic non-contiguous instructions (traces) • Cross multiple basic blocks • Need to predict multiple branches (MBP) E F G H I J K A B C D I$

11: 3 branches. 1: the trace ends w/ a branch 1st Br taken 2nd Br Not taken 11, 1 10 BB3 BB1 BB2 Trace Cache [Rotenberg Bennett Smith MICRO‘96] For T.C. miss Line fill buffer Br mask Br flag Tag Fall-thru Address Taken Address Mbranches T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr • Cache at most (in original paper) • M branches OR (M = 3 in all follow-up TC studies due to MBP) • N instructions (N = 16 in all follow-up TC studies) • Fall-thru address if last branch is predicted not taken MBP

Trace Hit Logic Fetch: A Tag BF Mask Fall-thru Target A 10 11,1 X Y Multi-BPred N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit

5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit 16 instructions Trace Cache (5 lines)

5 insts A 12 insts 6 insts B C A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 B1 B1 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 C6 C6 C7 C7 C8 C8 C9 C9 C10 C10 C11 C11 D D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 B1 B1 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 4 insts Exit C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 C6 C6 C7 C7 C8 C8 C9 C9 C10 C10 C11 C11 C12 C12 D1 D1 D2 D2 D3 D3 D4 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit C12 C12 Trace Cache (5 lines)

5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 C12 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache is Full C12 Trace Cache (5 lines)

5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC How many hits? What is the utilization? C12

A B C D A B C B C D C D A B D A Trace Cache Redundancy • Duplication • Note that instructions only appear once in I-Cache • Same instruction appears many times in TC • Fragmentation • If 3 BBs < 16 instructions • If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. • Empty slots  wasted resources • Example • A single BB is broken up to (ABC), (BCD), (CDA), (DAB) • Duplicating each instruction 3 times 6 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3

E A C B C D Trace Cache Indexability • TC saved traces (EAC) and (BCD) • Path: (EAC) to (D) • Cannot index interior block (D) • Can cause duplication • Need partial matching • (BCD) is cached, if (BC) is needed E A B C D G

Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc.

Advanced Computer Architecture: Instruction Fetch Techniques

Advanced Computer Architecture: Instruction Fetch Techniques

Presentation Transcript

ECE 366 -- Computer Architecture Lecture Notes # 6

ECE 4100/6100 Advanced Computer Architecture Lecture 4 ISA Taxonomy

ECE C61 Computer Architecture Lecture 3 – Instruction Set Architecture

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP)

ECE/CS 757: Advanced Computer Architecture II

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 17 Vectors

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 18 Multiprocessors

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 19 Summary

ECE 4100/6100 Advanced Computer Architecture Lecture 5 Branch Prediction

ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance

ECE 4100/6100 Advanced Computer Architecture Lecture 9 Memory Hierarchy Design (I)

ECE 4100/6100 Advanced Computer Architecture Lecture 14 Multiprocessor and Memory Coherence

Advanced Computer Architecture 5MD00 / 5Z033 MIPS Instruction-Set Architecture

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multithreading and Multicore Processors

ECE 4100/6100 Advanced Computer Architecture Lecture 7 Dynamic Scheduling (I)

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 17 Vectors

Advanced Computer Architecture 5MD00 / 5Z032 MIPS Instruction-Set Architecture

Advanced Computer Architecture Lecture 10

Advanced Computer Architecture Lecture 18

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued

ECE 4100/6100 Advanced Computer Architecture Lecture 10 Memory Hierarchy Design (II)

ECE C61 Computer Architecture Lecture 2 – performance