Computer Architecture: A Constructive Approach
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology. Next fetch started. PC. Fetch. I-cache. Fetch Buffer. Decode. Issue Buffer. Execute. Func. Units. Result Buffer.

Download Presentation

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computer architecture a constructive approach branch prediction 1 arvind

Computer Architecture: A Constructive Approach

Branch Prediction - 1

Arvind

Computer Science & Artificial Intelligence Lab.

Massachusetts Institute of Technology

http://csg.csail.mit.edu/6.S078


Control flow penalty

Next fetch started

PC

Fetch

I-cache

Fetch Buffer

Decode

Issue

Buffer

Execute

Func.

Units

Result

Buffer

Branchexecuted

Commit

Arch.

State

Control Flow Penalty

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !

How much work is lost if pipeline doesn’t follow correct instruction flow?

~ Loop length x pipeline width

http://csg.csail.mit.edu/6.S078


Average run length between branches

Average Run-Length between Branches

Average dynamic instruction mix from SPEC92:

SPECint92 SPECfp92

ALU39 %13 %

FPU Add 20 %

FPU Mult13 %

load26 %23 %

store 9 % 9 %

branch16 % 8 %

other10 %12 %

SPECint92: compress, eqntott, espresso, gcc, li

SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run-lengthbetween branches?

http://csg.csail.mit.edu/6.S078


Mips branches and jumps

MIPS Branches and Jumps

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

1. Is the preceding instruction a taken branch?

2.If so, what is the target address?

InstructionTaken known?Target known?

J

JR

BEQZ/BNEZ

After Inst. Decode

After Inst. Decode

After Inst. Decode

After Reg. Fetch

After Exec

After Inst. Decode

http://csg.csail.mit.edu/6.S078


Currently our simple pipelined architecture does very simple branch prediction

Currently our simple pipelined architecture does very simple branch prediction

What is it?

Branch is predicted not taken: pc, pc+4, pc+8, …

Can we do better?

http://csg.csail.mit.edu/6.S078


Branch prediction bits

Branch Prediction Bits

  • Assume 2 BP bits per instruction

  • Use saturating counter

http://csg.csail.mit.edu/6.S078


Branch history table bht

Fetch PC

0

0

I-Cache

k

2k-entry

BHT,

2 bits/entry

BHT Index

Instruction

Opcode

offset

+

Branch?

Taken/¬Taken?

Target PC

Branch History Table (BHT)

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

http://csg.csail.mit.edu/6.S078


Where does bht fit in the processor pipeline

Where does BHT fit in the processor pipeline?

  • BHT can only be used after instruction decode

  • What should we do at the fetch stage?

  • Need a mechanism to update the BHT

    • where does the update information come from

http://csg.csail.mit.edu/6.S078


Overview of branch prediction

BP,

JMP,

Ret

Next Addr

Pred

Decode

RegRead

Execute

Overview of branch prediction

Best predictors reflect program behavior

PC

Instr type, PC relative targets available

Simple conditions, register targets available

Complex conditions available

Need next PC immediately

Tight loop

Loose loop

Loose loop

Loose loop

http://csg.csail.mit.edu/6.S078


Next address predictor nap first attempt

Next Address Predictor (NAP)first attempt

predicted

BPb

target

Branch

Target

Buffer

(2k entries)

iMem

k

PC

target

BP

BP bits are stored with the predicted target address.

IF stage: nPC = If (BP=taken) then target else pc+4

later: check prediction, if wrong then kill the instruction

and update BTB & BPb else update BPb

http://csg.csail.mit.edu/6.S078


Address collisions

132 Jump 100

1028 Add .....

target

BPb

236

take

Address Collisions

Assume a

128-entry

NAP

Instruction

Memory

What will be fetched after the instruction at 1028?

NAP prediction=

Correct target=



236

1032

kill PC=236 and fetch PC=1032

Is this a common occurrence?

Can we avoid these bubbles?

http://csg.csail.mit.edu/6.S078


Use nap for control instructions only

Use NAP for Control Instructions only

NAP contains useful information for branch and

jump instructions only

Do not update it for other instructions

For all other instructions the next PC is (PC)+4 !

How to achieve this effect without decoding the

instruction?

http://csg.csail.mit.edu/6.S078


Branch target buffer btb a special form of nap

I-Cache

PC

Entry PC

predicted

Valid

target PC

k

=

match

target

valid

Branch Target Buffer (BTB)a special form of NAP

2k-entry direct-mapped BTB

  • Keep the (pc, predicted pc) in the BTB

  • pc+4 is predicted if no pc match is found

  • BTB is updated only for branches and jumps

Permits nextPC to be determined before instruction is decoded

http://csg.csail.mit.edu/6.S078


Consulting btb before decoding

132 Jump 100

entry PC

target

BPb

1028 Add .....

132

236

take

Consulting BTB Before Decoding

  • The match for pc =1028 fails and 1028+4 is fetched

    •  eliminates false predictions after ALU instructions

  • BTB contains entries only for control transfer instructions

    • more room to store branch targets

Even very small BTBs are very effective

http://csg.csail.mit.edu/6.S078


Observations

Observations

There is a plethora of branch prediction schemes – their importance grows with the depth of processor pipeline

Processors often use more than one prediction scheme

It is usually easy to understand the data structures required to implement a particular scheme

It takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each other

http://csg.csail.mit.edu/6.S078


Computer architecture a constructive approach branch prediction 1 arvind

Plan

revisit the simple two-stage pipeline without branch prediction

We will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in it

We will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture)

http://csg.csail.mit.edu/6.S078


Decoupled fetch and execute

Decoupled Fetch and Execute

nextPC

<updated pc>

Fetch

Execute

ir

<instructions, pc, epoch>

Fetch sends instructions to Execute along with pc and other control information

Execute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the nextPCfifo

http://csg.csail.mit.edu/6.S078


A solution using epoch

A solution using epoch

  • Add fEpoch and eEpoch registers to the processor state; initialize them to the same value

  • The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFO

  • Associate the fEpoch with every instruction when it is fetched

  • In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch

http://csg.csail.mit.edu/6.S078


Two stage pipeline a robust two rule solution

Two-Stage pipelineA robust two-rule solution

Bypass

FIFO

Register File

eEpoch

fEpoch

nextPC

PC

Execute

Decode

ir

+4

Pipeline

FIFO

Data

Memory

Inst

Memory

Either fifo can be a normal (>1 element) fifo

http://csg.csail.mit.edu/6.S078


Two stage pipeline decoupled

Two-stage pipeline Decoupled

modulemkProc(Proc);

Reg#(Addr) pc <- mkRegU;

RFilerf <- mkRFile;

IMemoryiMem<- mkIMemory;

DMemorydMem <- mkDMemory;

PipeReg#(TypeFetch2Decode) ir<- mkPipeReg;

Reg#(Bool) fEpoch <- mkReg(False);

Reg#(Bool) eEpoch <- mkReg(False);

FIFOF#(Addr) nextPC <- mkBypassFIFOF;

ruledoFetch (ir.notFull);

letinst = iMem(pc);

ir.enq(TypeFetch2Decode

{pc:pc, epoch:fEpoch, inst:inst});

if(nextPC.notEmpty) begin

pc<=nextPC.first; fEpoch<=!fEpoch;nextPC.deq;end

else pc <= pc + 4;

endrule

explicit guard

simple branch prediction

http://csg.csail.mit.edu/6.S078


Two stage pipeline decoupled cont

Two-stage pipeline Decoupled cont

rule doExecute (ir.notEmpty);

letirpc= ir.first.pc; letinst = ir.first.inst;

if(ir.first.epoch==eEpoch) begin

leteInst = decodeExecute(irpc, inst, rf);

letmemData <- dMemAction(eInst, dMem);

regUpdate(eInst, memData, rf);

if (eInst.brTaken) begin

nextPC.enq(eInst.addr);

eEpoch <= !eEpoch;

end

end

ir.deq;

endrule

endmodule

http://csg.csail.mit.edu/6.S078


Two stage pipeline with a branch predictor

Two-Stage pipeline with a Branch Predictor

Register File

eEpoch

fEpoch

nextPC

PC

Execute

Decode

ir

+

ppc

Branch

Predictor

Data

Memory

Inst

Memory

http://csg.csail.mit.edu/6.S078


Branch predictor interface

Branch Predictor Interface

interface NextAddressPredictor;

method Addr prediction(Addr pc);

method Action update(Addr pc,

Addr target);

endinterface

http://csg.csail.mit.edu/6.S078


Null branch prediction

Null Branch Prediction

  • Replaces PC+4 with …

    • Already implemented in the pipeline

  • Right most of the time

    • Why?

module mkNeverTaken(NextAddressPredictor);

method Addr prediction(Addr pc);

return pc+4;

endmethod

method Action update(Addr pc, Addr target);

noAction;

endmethod

endmodule

http://csg.csail.mit.edu/6.S078


Branch target prediction btb

Branch Target Prediction (BTB)

module mkBTB(NextAddressPredictor);

RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull;

RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull;

method Addr prediction(Addr pc);

LineIdx index = truncate(pc >> 2);

let tag = tagArr.sub(index);

let target = targetArr.sub(index);

if (tag==pc) return target; else return (pc+4);

endmethod

method Action update(Addr pc, Addr target);

LineIdx index = truncate(pc >> 2);

tagArr.upd(index, pc); targetArr.upd(index, target);

endmethod

endmodule

http://csg.csail.mit.edu/6.S078


Two stage pipeline bp

Two-stage pipeline + BP

modulemkProc(Proc);

Reg#(Addr) pc <- mkRegU;

RFilerf <- mkRFile;

IMemoryiMem<- mkIMemory;

DMemorydMem <- mkDMemory;

PipeReg#(TypeFetch2Decode) ir<- mkPipeReg;

Reg#(Bool) fEpoch <- mkReg(False);

Reg#(Bool) eEpoch <- mkReg(False);

FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF;

NextAddressPredictorbpred <- mkNeverTaken;

Some target predictor

The definition of TypeFetch2Decode is changed to include predicted pc

typedefstruct {

Addr pc; Addrppc; Bool epoch; Data inst;

} TypeFetch2Decode deriving (Bits, Eq);

http://csg.csail.mit.edu/6.S078


Two stage pipeline bp fetch rule

Two-stage pipeline + BP Fetch rule

ruledoFetch (ir.notFull);

let ppc = bpred.prediction(pc);

letinst = iMem(pc);

ir.enq(TypeFetch2Decode

{pc:pc, ppc:ppc, epoch:fEpoch, inst:inst});

if(nextPC.notEmpty) begin

match{.ipc, .ippc} = nextPC.first;

pc <= ippc; fEpoch <= !fEpoch; nextPC.deq;

bpred.update(ipc, ippc);

end

else pc <= ppc;

endrule

http://csg.csail.mit.edu/6.S078


Two stage pipeline bp execute rule

Two-stage pipeline + BP Execute rule

rule doExecute (ir.notEmpty);

letirpc= ir.first.pc; letinst = ir.first.inst;

letirppc = ir.first.ppc;

if(ir.first.epoch==eEpoch) begin

leteInst = decodeExecute(irpc, irppc, inst, rf);

letmemData <- dMemAction(eInst, dMem);

regUpdate(eInst, memData, rf);

if (eInst.missPrediction) begin

nextPC.enq(tuple2(irpc,

eInst.brTaken ? eInst.addr : irpc+4));

eEpoch <= !eEpoch;

end

end

ir.deq;

endrule

endmodule

http://csg.csail.mit.edu/6.S078


Execute function

Execute Function

functionExecInstexec(DecodedInstdInst, Data rVal1,

Data rVal2, Addrpc, Addrppc);

ExecInsteinst = ?;

letaluVal2 = (dInst.immValid)? dInst.imm : rVal2

letaluRes = alu(rVal1, aluVal2, dInst.aluFunc);

letbrAddr = brAddrCal(pc, rVal1, dInst.iType,

dInst.imm);

einst.itype = dInst.iType;

einst.addr = (memType(dInst.iType)? aluRes : brAddr;

einst.data = dInst.iType==St ? rVal2 : aluRes;

einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp);

einst.missPrediction = brTaken ? brAddr!=ppc :

(pc+4)!=ppc;

einst.rDst = dInst.rDst;

returneinst;

endfunction

http://csg.csail.mit.edu/6.s078Rev


  • Login