ECE 512, Microarchitecture Department of Electrical and Computer Engineering

ECE 512, Microarchitecture Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign M. Frank EXAM April 26, 2006, 7:00pm to April 27, 2006, 8:00pm Open book/note/web/calculator/computer. By turning in this exam, you attest that you have neither received nor given inappropriate aid on this exam from any person except the instructor. Note: The exam has a total of 11 questions and 125 points. The questions do not have equal weights. NAME: ____________________________________

1.(5pts) For some reason there are lots of academic papers proposing methods for pipelining the scheduling logic and few about pipelining the rename logic. Yet, in both cases (scheduling and renaming) there is a critical loop, where each instruction depends on the previous. Are academics stupid or is there a reason for this discrepancy (or both)? 2.(5pts) Give two reasons why it was important for the Alpha 21264 to keep instructions in the scheduler sorted by the order in which they were fetched. 3.(5pts) In class several weeks ago I tried (and failed) to construct a simple tradeoff analysis that would show us when predication is preferred to speculation and conversely. Fix the model. Assume we have an if-then-else construct. The branch has been profiled and has been found to have a bias of x%. The then clause has t instructions and the else clause has e instructions. We are running on a single-issue in-order processor with a static branch predictor (the compiler puts a bias bit into the branch instruction) and an m cycle branch mispredict penalty. Every instruction has single cycle latency. Without loss of generality, assume the branch is biased toward the “then” clause (assume that 0≤x<0.50). Construct a simple algebraic expression to describe the circumstances under which it is preferable to predicate and remove the branch.

4.(5pts) Suppose I have a loop with 6 instructions that I want to modulo schedule on a machine with only 4 issue slots. How many times should I unroll the loop? What if my loop has 7 instructions? Give the unroll factor I must use to achieve 100% issue utilization on an M wide machine and a loop of S instructions. 5.(5pts) You are the lead architect at Illinco, the market leading manufacturer of high performance processors. Your competitors at Wisco have just released their newest product, and it seems to be better than your latest design. Your secret contacts inside Wisco have told you that they are using a gshare predictor, a loop predictor and a return address stack, but haven’t told you how many bits are in the history register. Construct a simple piece of code that can be timed to estimate the history register size. 6.(5pts) The Itanium provides architectural support to speculatively move loads above stores, but provides no architectural support to speculatively move stores above branches. Construct a loop (in C) that will perform much worse on the Itanium than it would on a machine with a similar set of resources and an additional store buffer.

7.(20 pts) We are given a pipelined out-of-order superscalar processor that is 4-wide and sends each instruction through 7 stages: fetch, decode, rename, schedule, register-read, execute and (eventually) retire. The machine has a 4-way-associative branch target buffer with 16 total entries, and branch prediction is performed by a two-bit saturating counter associated with each entry. Assume that the machine contains 4 identical fully-bypassed functional units, that all instructions except mul take a single cycle to execute, and that the mul instruction requires two cycles to execute.. Your lab partner has written two code snippets which perform the same function. Snippet A (below on the left) has two inner loops, the first of which is used to calculate some value in register $s0, the second of which is used to calculate some value in register $s1. Snippet B (below on the right) calculates the same values in register $s0 and $s1, but the two inner loops have been fused, based on the observation that they both iterate the same number of times (based on the constant Cmax). You have observed that Snippet B executes faster than Snippet A, and the question is why? A) Calculate the branch misprediction rate for both snippets as a function of Cmax, assuming that the branch target buffer is warmed up (already contains the branches in the snippets from a recent previous execution of this code). B) Assuming that fetch starts at instruction 0, show the wind-up, wind-down and steady state execution schedules for the top loop in Snippet A (instructions 1 through 4). C) Calculate the number of cycles between executions of instruction 0 for both Snippets A and B as a function of the constant Cmax. D) Explain the bottleneck(s) in Snippet A, and how Snippet B fixes them. E) How many times would you need to unroll if you wanted to produce a similar schedule for Snippet B on a statically scheduled machine with neither register renaming nor rotating register files. Snippet B 0: start: $c <- 0 1: top: $t0 <- mul $r0, $c 2: $t1 <- mul $r1, $c 3: $c <- $c + 1 4: $s0 <- $s0 + $t0 5: $s1 <- $s1 + $t1 6: if $c < Cmax goto top 7: goto start Snippet A 0: start: $c <- 0 1: top0: $t0 <- mul $r0, $c 2: $c <- $c + 1 3: $s0 <- $s0 + $t0 4: if $c < Cmax goto top0 5: $c <- 0 6: top1: $t1 <- mul $r1, $c 7: $c <- $c + 1 8: $s1 <- $s1 + $t1 9: if $c < Cmax goto top1 10: goto start

8.(20pts) You are designing a new single-issue in-order pipeline and would like to implement precise interrupts. Your ISA has 32 registers and you have already decided that your instruction latencies will be 1, 2, 5 and 8 cycles. You have narrowed down the organization to one of two choices. Either a Smith and Pleszkun style reorder buffer mechanism (i.e., the familiar inorder pipeline from your undergraduate computer organization class: 8 execution stages after the register file with register writeback from the final stage and a forwarding unit in the first execution stage.) The other choice is to use a Smith and Pleszkun style future file. Evaluate the circuit complexity tradeoffs between these two approaches (what kinds of register files and comparators do you need, how big are they, etc.) Be specific, it might be helpful to draw informal schematics.

9.(20pts) Palacharla, Jouppi and Smith produced a simplified delay model for dynamic wakeup logic that assumed that tags in the wakeup logic are encoded (use a number of bits logarithmic in the number of possible tags). In particular, they found that the length of the tag drive lines is proportional to the product of the issue width (maximum number of instructions that can signal “completion” per cycle) and the window size (maximum number of instructions that can be waiting to issue). Likewise they found that the length of the tag-match lines was roughly proportional to the issue width. A) Assume we have a wakeup window organized as in the Pentium-4. That is, assume that there is a renaming-like structure that allocates wakeup-window slots to instructions, and maps source register identifiers to the wakeup-window slot of the producer instruction (if the producer instruction has not yet completed). Describe the structure of the renaming-like structure in terms of the number of architectural and physical registers, the fetch and issue width, the number of wakeup-window slots and the number of slots in the retirement buffer. B) The renaming structure described in part A allows the Pentium-4 to get away with using only as many wakeup tags as there are slots in the wakeup window. As a result the tags in the wakeup window can be stored decoded (also called “one-hot encoded”). Perform a Palacharla, Jouppi and Smith style analysis on the length of the tag drive and tag match lines. (I’m looking for statements of the form, “the length is proportional to (product/sum) of the ____ and the ____.”)

10.(15pts) The Pentium-4 uses a trace cache mechanism mainly to avoid the power and latency costs of decoding CISC instructions every time they are fetched. Another benefit of trace cache mechanisms is that they reduce the number of “partial fetches.” (Recall that a “partial fetch” is a circumstance where we can not fetch the maximum width of the machine because the next fetch unit ends in a branch or crosses a cache line boundary.) The downside of a trace cache mechanism is that it might store multiple copies of the same instruction at different places in the cache, and so trace caches must be larger than instruction caches to achieve similar hit rates. You are designing a wide-issue RISC processor and your lab partner suggests that there is an “easy” way to use indirection to get the best of both worlds. Rather than replacing the instruction cache with a trace cache, you will use a “trace BTB.” Each “BTB” entry will store the start and end addresses of up to 4 basic blocks. Then you will bank the instruction cache 8 ways, so that with high likelihood you will be able to access all 4 basic blocks every cycle. If there are bank conflicts then the scheme won’t perform so well, but your lab partner argues that with enough banks the probability of conflict will be low. Discuss the tradeoffs in using the suggested technique.

11.(20pts) The new Illin™ 512 processor is a statically scheduled in-order machine with a RISC instruction set. The 512 has two new instructions: “checkpoint,” and “assert.” Checkpoint saves a “snapshot” of the machine’s register state along with a program counter to jump to if an assert fails. Assert tests a condition (like a branch would) and, if the condition fails, rolls the machine state back to the state saved at the most recent checkpoint instruction, and jumps to the recovery code. After profiling you have identified an important loop in your program and, using the checkpoint and assert instructions, you have unrolled it as shown below. A) Identify all of the redundant instructions in the unrolled code. (An instruction is redundant if you can completely eliminate it without changing the final machine state.) Provide a brief justification for each instruction that you identify as redundant. B) Observe that on each iteration of the unrolled loop (instructions 1 through 19) register Ri is incremented by exactly 3 and Rc gets incremented by 3*Rb. How can this observation be used to further reduce each iteration’s cost? 1 top: Checkpoint (recover at original_loop) 2 Ra <- Rx + Ry 3 Rb <- Ra + Rz 4 Rc <- Rc + Rb 5 Rd <- Rc + Rw 6 Ri <- Ri + 1 7 Assert Ri < Rn 8 Ra <- Rx + Ry 9 Rb <- Ra + Rz 10 Rc <- Rc + Rb 11 Rd <- Rc + Rw 12 Ri <- Ri + 1 13 Assert Ri < Rn 14 Ra <- Rx + Ry 15 Rb <- Ra + Rz 16 Rc <- Rc + Rb 17 Rd <- Rc + Rw 18 Ri <- Ri + 1 19 if (Ri < Rn) goto top 20 else goto exit 21 original_loop: 22 Ra <- Rx + Ry 23 Rb <- Ra + Rz 24 Rc <- Rc + Rb 25 Rd <- Rc + Rw 26 Ri <- Ri + 1 27 if (Ri < Rn) goto original_loop

ECE 512, Microarchitecture Department of Electrical and Computer Engineering