1 / 18

CSC 4250 Computer Architectures

CSC 4250 Computer Architectures. October 27, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation. Nested Loops. DADDIU R1,R0,#80 Loop1: L.D F2,1600(R1) DADDIU R2,R0,#40 Loop2: L.D F0,1000(R2) ADD.D F0,F0,F2 S.D F0,1000(R2) DADDIU R2,R2,#−8 BNEZ R2,Loop2

tahlia
Download Presentation

CSC 4250 Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 4250Computer Architectures October 27, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation

  2. Nested Loops DADDIU R1,R0,#80 Loop1: L.D F2,1600(R1) DADDIU R2,R0,#40 Loop2: L.D F0,1000(R2) ADD.D F0,F0,F2 S.D F0,1000(R2) DADDIU R2,R2,#−8 BNEZ R2,Loop2 DADDIU R1,R1,#−8 BNEZ R1,Loop1 How many times do Loop1 and Loop2 iterate?

  3. BNEZ R2,Loop2 Branch history: TTTTN|TTTTN|TTTTN|TTTTN|… N means branch not taken. 1-bit predictor: TTTTT|NTTTT|NTTTT|NTTTT|… → two errors per iteration. 2-bit predictor: TTTTT|TTTTT|TTTTT|TTTTT|… → one error per iteration. The error behavior for Loop1 is similar. Put more bits in the counter to improve error behavior?

  4. Global Branch History Global branch history: TTTTN|T|TTTTN|T|TTTTN|T|TTTTN|T| … Loop 22222 |1| 22222 |1| 22222 |1| 22222 |1| … Can we use global branch history to get a better result? (On previous slide, we looked at local branch history.)

  5. 5-Bit Global Branch History We keep a 5-bit global branch history, and use the bit pattern to choose one of 25 1-bit predictors: TTTTT N TTTTN T TTTNT T TTNTT T TNTTT T NTTTT T … . NNNNN T We get 100% accuracy in the steady state. This strategy works if at least 5 bits are used.

  6. Correlating Branch Predictors (p. 200) • A 2-bit predictor uses only the recent behavior of a single branch. • SPEC92 benchmark eqntott (the worst case in Figures 3.8 and 3.9 with an 18% error rate): if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb) {

  7. MIPS Code Assume that aa and bb are assigned to R1 and R2: DSUBUI R3,R1,#2 BNEZ R3,L1 ;branch b1 (aa!=2) DADD R1,R0,R0 ;aa=0 L1: DSUBUI R3,R2,#2 BNEZ R3,L2 ;branch b2 (bb!=2) DADD R2,R0,R0 ;bb=0 L2: DSUBU R3,R1,R2 ;R3=aa−bb BEQZ R3,L3 ;branch b3 (aa==bb) Consider the branches. The behavior of branch b3 is correlated with the behavior of branches b1 and b2: if both b1 and b2 are not taken, then b3 will be taken (as aa and bb are equal).

  8. Simplified Example (p. 202) Suppose that d has values 0, 1, and 2: if (d==0) d=1; if (d==1) MIPS Code: Assume that d is assigned to R1: BNEZ R1,L1 ;branch b1 (d!=0) DADDUI R1,R0,#1 ;d==0, so d=1 L1: DADDUI R3,R1,#−1 BNEZ R3,L2 ;branch b2 (d!=1) … L2:

  9. Figure 3.10. Possible execution sequence

  10. Figure 3.11. Behavior of 1-bit predictor initialized to NTSuppose that d = 2, 0, 2, 0, … Misprediction Rate = 100%!

  11. Figure 3.12. Meaning of Prediction Bits

  12. Fig. 3.13. Action of 1-bit predictor with 1 bit of correlation.Initialized to NT/NT

  13. Figure 3.14. A (2,2) Branch Prediction Buffer • This buffer uses a 2-bit global history to choose from among 22 predictors for each branch address. Each predictor is in turn a 2-bit predictor for that branch. • Figure 3.12 shows a (1,1) branch prediction buffer.

  14. Figure 3.15. Comparison of 2-bit Predictors

  15. Tournament Predictors (p. 206) • Adaptively combine local and global predictors. • Alpha 21264 has a tournament predictor using 4K 2-bit counters indexed by the local branch address to choose from between a global predictor and a local predictor. The global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor. The local predictor consists of a 2-level predictor. The top level is a local history table consisting of 1024 10-bit entries. The entry is used to index a table of 1K entries consisting of 3-bit saturating counters, providing the local prediction. (Total = 29K bits. For SPECfp95 benchmarks, less than 1 misprediction per 1000 completed instructions.)

  16. Fig. 3.16. State Transition Diagram for Tournament Predictor • The counter is incremented whenever the “predicted” predictor is correct and the other predictor is incorrect, and it is decremented in the reverse situation.

  17. Figure 3.17. Fraction of predictions from local predictor for a tournament predictor using SPEC89

  18. Figure 3.18. Misprediction rates for three different predictors on SPEC89 as total # of bits is increased

More Related