1 / 26

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Execution Core. Instruction Supply Issues.

tareq
Download Presentation

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 4100/6100Advanced Computer ArchitectureLecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

  2. Execution Core Instruction Supply Issues • Fetch throughput defines max performance that can be achieved in later stages • Superscalar processors need to supply more than 1 instruction per cycle • Instruction Supply limited by • Misalignment of multiple instructions in a fetch group • Change of Flow (interrupting instruction supply) • Memory latency and bandwidth Instruction Fetch Unit Instruction buffer

  3. Aligned Instruction Fetching (4 instructions) PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Cycle n Can pull out one row at a time inst 1 inst 2 inst 3 inst 4 Assume one fetch group = 16B

  4. Misaligned Fetch PC=..xx001000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network Cycle n inst 1 inst 2inst 3 inst 4 IBM RS/6000

  5. Split Cache Line Access PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 Cycle n inst 1 inst 2 Cycle n+1 inst 3 inst 4 Be broken down to 2 physical accesses

  6. Split Cache Line Access Miss PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line Bmisses Cycle n inst 1 inst 2 Cycle n+X inst 3 inst 4

  7. High Bandwidth Instruction Fetching • Wider issue  More instruction feed • Major challenge: to fetch more than one non-contiguous basic block per cycle • Enabling technique? • Predication • Branch alignment based on profiling • Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB6 BB7

  8. Typical assembly Assembly w/ predication lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Predication Example • Convert control dependency into data dependency • Enlarge basic block size • More room for scheduling • No fetch disruption Source code if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0

  9. Collapse Buffer[ISCA 95] • To fetch multiple (often non-contiguous) instructions • Use interleaved BTB to enable multiple branch predictions • Align instructions in the predicted sequential order • Use banked I-cache for multiple line access

  10. Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

  11. E F G H A B C D Bank Routing Interchange Switch E A A B C D E F G H Collapsing Circuit A B C D D F H E F G H A B C E G Collapsing Buffer Mechanism Interleaved BTB Valid Instruction Bits A E

  12. High Bandwidth Instruction Fetching • To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) • Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB6 BB7

  13. Multiple Branch Predictor [YehMarrPatt ICS’93] • Pattern History Table (PHT) design to support MBP • Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction

  14. Multiple Branch Predictin • Fetch address could be retrieved from BTB • Predicted path: BB1  BB2  BB5 • How to fetch BB2 and BB5? BTB? • Can’t. Branch PCs of br1 and br2 not available when MBP made • Use a BAC design Fetch address (br0 Primary prediction) BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 T F F (3rd) T BB4 BB5 BB6 BB7

  15. Branch Address Cache V br V br V br • Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions • br: 2 bits for branch type (cond, uncond, return) • V: single valid bit (to indicate if hits a branch in the sequence) • To make one more level prediction • Need to cache another 8 more addresses (i.e. total=14 addresses) • 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 1 23 bits 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB)

  16. BB1 BB3 BB5 BB1 BB2 Caching Non-Consecutive Basic Blocks • High Fetch Bandwidth + Low Latency BB3 BB5 BB2 BB4 Fetch in Conventional Instruction Cache BB4 Fetch in Linear Memory Location

  17. Trace Cache A B A B C A B C D E F G H I J C D E F G D H I J E F G A B C D E F G H I J H I J Collapsing Buffer Fetch (3 cycles) I$ Fetch (5 cycles) T$ Fetch (1 cycle) Trace Cache • Cache dynamic non-contiguous instructions (traces) • Cross multiple basic blocks • Need to predict multiple branches (MBP) E F G H I J K A B C D I$

  18. 11: 3 branches. 1: the trace ends w/ a branch 1st Br taken 2nd Br Not taken 11, 1 10 BB3 BB1 BB2 Trace Cache [Rotenberg Bennett Smith MICRO‘96] For T.C. miss Line fill buffer Br mask Br flag Tag Fall-thru Address Taken Address Mbranches T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr • Cache at most (in original paper) • M branches OR (M = 3 in all follow-up TC studies due to MBP) • N instructions (N = 16 in all follow-up TC studies) • Fall-thru address if last branch is predicted not taken MBP

  19. Trace Hit Logic Fetch: A Tag BF Mask Fall-thru Target A 10 11,1 X Y Multi-BPred N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit

  20. 5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit 16 instructions Trace Cache (5 lines)

  21. 5 insts A 12 insts 6 insts B C A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 B1 B1 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 C6 C6 C7 C7 C8 C8 C9 C9 C10 C10 C11 C11 D D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 B1 B1 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 4 insts Exit C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 C6 C6 C7 C7 C8 C8 C9 C9 C10 C10 C11 C11 C12 C12 D1 D1 D2 D2 D3 D3 D4 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit C12 C12 Trace Cache (5 lines)

  22. 5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D1 D2 D2 D3 D3 D4 D4 A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 C12 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache is Full C12 Trace Cache (5 lines)

  23. 5 insts A 12 insts 6 insts B C A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 D D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 4 insts Exit C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC How many hits? What is the utilization? C12

  24. A B C D A B C B C D C D A B D A Trace Cache Redundancy • Duplication • Note that instructions only appear once in I-Cache • Same instruction appears many times in TC • Fragmentation • If 3 BBs < 16 instructions • If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. • Empty slots  wasted resources • Example • A single BB is broken up to (ABC), (BCD), (CDA), (DAB) • Duplicating each instruction 3 times 6 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3

  25. E A C B C D Trace Cache Indexability • TC saved traces (EAC) and (BCD) • Path: (EAC) to (D) • Cannot index interior block (D) • Can cause duplication • Need partial matching • (BCD) is cached, if (BC) is needed E A B C D G

  26. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc.

More Related