slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter 4: The Processor (part A) PowerPoint Presentation
Download Presentation
Chapter 4: The Processor (part A)

Loading in 2 Seconds...

play fullscreen
1 / 58

Chapter 4: The Processor (part A) - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Note: about two thirds of slides in this file are adopted from Mary Jane Irwin’s work: ( www.cse.psu.edu/~mji ). Chapter 4: The Processor (part A). System Bus Model Revisited. System Components CPU, Memory, I/O devices CPU Components Datapath and control unit Datapath Components

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 4: The Processor (part A)' - cassady-beasley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Note: about two thirds of slides in this fileare adopted from Mary Jane Irwin’s work: ( www.cse.psu.edu/~mji )

Chapter 4: The Processor

(part A)

system bus model revisited
System Bus Model Revisited
  • System Components
    • CPU, Memory, I/O devices
  • CPU Components
    • Datapath and control unit
  • Datapath Components
    • ALU and registers
  • System Interconnections
    • Data Bus, Address Bus, and Control Bus
  • Program Execution
    • Instruction Fetch-Execute Cycle
instruction execution cycle
Instruction Execution Cycle
  • A Program
    • consists of a sequence of instructions.
    • For each instruction, an arithmetic or logic operation, or data movement is performed.
  • Instruction Execution Cycle (Fetch and execute)
    • Program Counter (PC) holds address of the next instruction to fetch.
    • Processor fetches the instruction from the memory location pointed to by PC.
    • Increment PC by 4, or told otherwise such as increment PC by 4 and then be added by the address offset, e.g., jump instruction.
    • Processor interprets instruction and performs required actions.
instruction fetch execution cycle for mips
Instruction Fetch-Execution Cycle for MIPS

1.IF 2.ID 3.EX 4.MEM 5.WB

  • Execution and completion of a MIPS instruction usually includes all or most of the following five steps:
    • IF: Instruction fetching
    • ID: Instruction decoding / register read
    • EX: Arithmetic/logical operation execution / address calculation
    • MEM: Memory access
    • WB: Write back
a simplified mips instruction set
A Simplified MIPS Instruction Set
  • A simplified instruction set of only 9 instructions:
    • lw,
    • sw,
    • add,
    • sub,
    • and,
    • or,
    • slt,
    • beq,
    • j
implementation of a mips processor i
Implementation of a MIPS Processor (I)
  • Now we're ready to implement the simplified MIPS processor
  • We start with an implementation of the datapath, and then followed by the control unit.
  • For datapath, we start with individual components, then consider the connections between them.
implementation of a mips processor i1
Implementation of a MIPS Processor (I)
  • Hardware components involved in each of steps:
  • IF: Instruction fetching
    • Instruction memory, PC, IR, adder
  • ID: Instruction decoding / register read
    • IR, registers, control unit
  • EX: Arithmetic/logical operation execution
    • ALU: arithmetic & logic unit, control unit
  • MEM: Memory access
    • Data memory, control unit
  • WB: Write back
    • Registers, multiplexers, control unit
implementation of a mips processor ii
Implementation of a MIPS Processor (II)
  • Individual component can be implemented using either combinational logic or timing logic (refer to your digital logical design course):
    • Combinational circuits design for
      • ALU
      • Adder
      • Multiplexer
      • Control unit
    • Sequential logic design for
      • PC (program counter), IR (instruction register)
      • Registers
      • Instruction memory
      • Data memory
implementation of a mips processor iii
Implementation of a MIPS Processor (III)
  • Some components can be readily designed and implemented:
    • Adder
    • Multiplexer
    • Registers
    • Instruction memory
    • Data memory
  • For other components, we need to know the instruction set before a truth table can be created for the component.
    • ALU, Control unit, etc.
review mips risc design principles
Review: MIPS (RISC) Design Principles
  • Simplicity favors regularity
    • fixed size instructions
    • small number of instruction formats
    • opcode always the first 6 bits
  • Smaller is faster
    • limited instruction set
    • limited number of registers in register file
    • limited number of addressing modes
  • Make the common case fast
    • arithmetic operands from the register file (load-store machine)
    • allow instructions to contain immediate operands
  • Good design demands good compromises
    • three instruction formats
the processor datapath control

Fetch

PC = PC+4

Exec

Decode

The Processor: Datapath & Control
  • Highly simplified MIPS instruction set
    • memory-reference instructions: lw, sw
    • arithmetic-logical instructions: add, sub, and, or, slt
    • control flow instructions: beq, j
  • Generic implementation
    • use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC)
    • decode the instruction (and read registers)
    • execute the instruction
  • All instructions (except j) use the ALU after reading the registers

How? memory-reference? arithmetic? control flow?

aside clocking methodologies
Aside: Clocking Methodologies
  • The clocking methodology defines when data in a state element is valid and stable relative to the clock
    • State elements - a memory element such as a register
    • Edge-triggered – all state changes occur on a clock edge
  • Typical execution
    • read contents of state elements -> send values through combinational logic -> write results to one or more state elements

State

element

1

State

element

2

Combinational

logic

clock

one clock cycle

  • Assumes state elements are written on every clock cycle; if not, need explicit write control signal
    • write occurs only when both the write control is asserted and the clock edge occurs
implementation of a mips processor datapath i
Implementation of a MIPS Processor Datapath (I)
  • Now we look at constructing a datapath step by step for the five steps:
    • Instruction fetching
    • Instruction decoding / register read
    • Arithmetic/logical operation execution / address calculation
    • Memory access
    • Write back
fetching instructions

Add

4

Fetch

PC = PC+4

Instruction

Memory

Exec

Decode

Read

Address

clock

PC

Instruction

Fetching Instructions
  • Fetching instructions involves
    • reading the instruction from the Instruction Memory
    • updating the PC value to be the address of the next (sequential) instruction
  • PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal
  • Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal
decoding instructions

Fetch

PC = PC+4

Exec

Decode

Read Addr 1

Read

Data 1

Register

File

Read Addr 2

Write Addr

Read

Data 2

Write Data

Decoding Instructions
  • Decoding instructions involves
    • sending the fetched instruction’s opcode and function field bits to the control unit

and

Control

Unit

Instruction

  • reading two values from the Register File
    • Register File addresses are contained in the instruction
executing r format operations

31

25

20

15

10

5

0

R-type:

op

rs

rt

rd

shamt

funct

RegWrite

ALU control

Fetch

PC = PC+4

Read Addr 1

Read

Data 1

Register

File

Read Addr 2

overflow

Instruction

zero

Exec

Decode

ALU

Write Addr

Read

Data 2

Write Data

Executing R Format Operations
  • R format operations (add, sub, slt, and, or)
    • perform operation (op and funct) on values in rs and rt
    • store the result back into the Register File (into location rd)
  • Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
executing load and store operations

RegWrite

ALU control

MemWrite

overflow

zero

Read Addr 1

Read

Data 1

Address

Register

File

Read Addr 2

Instruction

Data

Memory

Read Data

ALU

Write Addr

Read

Data 2

Write Data

Write Data

MemRead

Sign

Extend

16

32

Executing Load and Store Operations
  • Load and store operations involves
    • compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction
    • store value (read from the Register File during decode) written to the Data Memory
    • load value, read from the Data Memory, written to the Register File
executing branch operations
Executing Branch Operations
  • Branch operations involves
    • compare the operands read from the Register File during decode for equality (zero ALU output)
    • compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr

Branch

target

address

Add

Add

4

Shift

left 2

ALU control

PC

zero

(to branch control logic)

Read Addr 1

Read

Data 1

Register

File

Read Addr 2

Instruction

ALU

Write Addr

Read

Data 2

Write Data

Sign

Extend

16

32

executing jump operations
Executing Jump Operations
  • Jump operation involves
    • replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

Add

4

4

Jump

address

Instruction

Memory

Shift

left 2

28

Read

Address

PC

Instruction

26

creating a single datapath from the parts
Creating a Single Datapath from the Parts
  • Assemble the datapath segments and add control lines and multiplexors as needed
  • Single cycle design – fetch, decode and execute each instructions in one clock cycle
    • no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)
    • multiplexors needed at the input of shared elements with control lines to do the selection
    • write signals to control writing to the Register File and Data Memory
  • Cycle time is determined by length of the longest path
fetch r and memory access portions

Add

RegWrite

ALUSrc

ALU control

MemWrite

MemtoReg

4

ovf

zero

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Read Addr 2

Data

Memory

Read

Address

PC

Instruction

Read Data

ALU

Write Addr

Read

Data 2

Write Data

Write Data

MemRead

Sign

Extend

16

32

Fetch, R, and Memory Access Portions
adding the control

31

25

0

J-type:

op

target address

Adding the Control
  • Selecting the operations to perform (ALU, Register File and Memory read/write)
  • Controlling the flow of data (multiplexor inputs)

31

25

20

15

10

5

0

R-type:

op

rs

rt

rd

shamt

funct

31

25

20

15

0

  • Observations
    • op field always in bits 31-26
    • addr of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and swrs is the base register
    • addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions
    • offset for beq, lw, and swalways in bits 15-0

I-Type:

address offset

op

rs

rt

single cycle datapath with control unit
Single Cycle Datapath with Control Unit

0

Add

Add

1

4

Shift

left 2

PCSrc

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

r type instruction data control flow
R-type Instruction Data/Control Flow

0

Add

Add

1

4

Shift

left 2

PCSrc

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

load word instruction data control flow
Load Word Instruction Data/Control Flow

0

Add

Add

1

4

Shift

left 2

PCSrc

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

branch instruction data control flow
Branch Instruction Data/Control Flow

0

Add

Add

1

4

Shift

left 2

PCSrc

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

adding the jump operation
Adding the Jump Operation

Instr[25-0]

1

Shift

left 2

28

32

26

0

PC+4[31-28]

0

Add

Add

1

4

Shift

left 2

PCSrc

Jump

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

fig 4 24 p 329
Fig.4.24 (p.329)

Instr[25-0]

1

Shift

left 2

28

32

26

0

PC+4[31-28]

0

Add

Add

1

4

Shift

left 2

PCSrc

Jump

ALUOp

Branch

MemRead

Instr[31-26]

Control

Unit

MemtoReg

MemWrite

ALUSrc

RegWrite

RegDst

ovf

Instr[25-21]

Read Addr 1

Instruction

Memory

Read

Data 1

Address

Register

File

Instr[20-16]

zero

Read Addr 2

Data

Memory

Read

Address

PC

Instr[31-0]

0

Read Data

1

ALU

Write Addr

Read

Data 2

0

1

Write Data

0

Instr[15 -11]

Write Data

1

Instr[15-0]

Sign

Extend

ALU

control

16

32

Instr[5-0]

implementing the control units
Implementing the Control Units
  • The control unit includes mainly
    • ALU control unit
    • Main control unit
  • Using only combinational circuits (simple!)
  • Inputs are from each instruction’s
    • op-code field (6 bits [31:26]) for all instructions, and
    • funct field (6 bits [5:0]) for R-type instructions.
  • Outputs are the control lines to control
    • ALU,
    • Multiplexers,
    • registers, and
    • memory
implementing alu control unit i
Implementing ALU Control Unit (I)
  • Inputs: 6-bit function code, plus 2-bit ALUOp from main control unit
  • Outputs: 4-bit ALU control lines used to decide which operation ALU performs
implementing alu control unit ii
Implementing ALU Control Unit (II)
  • Unit Inputs: ALUOp code (2 bits) and Funct field (6 bits)
    • ALUOp code is generated at the main control unit
  • Unit Outputs: ALU control lines (4 bits)
  • The circuit for ALU control unit
    • Obtained through combinational digital logic design method
implementing main control unit i
Implementing Main Control Unit (I)
  • Inputs are 6-bit op-code from all instructions
    • op-code field (6 bits [31:26])
  • Outputs (10-bit) are the control lines to control
    • Memory (2 bits)
      • MemRead, MemWrite
    • Multiplexers (5 bits)
      • RegDst, (Jump), Branch, MemtoReg, ALUSrc,
    • Registers (1 bit)
      • RegWrite
    • and 2-bit ALUOp (2 bits)
      • ALUOp0, ALUOp1
implementing main control unit ii
Implementing Main Control Unit (II)

The Outputs of Main Control Unit

implementing main control unit iii
Implementing Main Control Unit (III)

Truth table for main control unit:

implementing main control unit iv
Implementing Main Control Unit (IV)

The circuit for main control unit:

handle the jump instruction
Handle the Jump Instruction
  • For jump instruction, the target address can be formed with the concatenation of
    • The upper 4 bits of [PC]+4
    • The 26-bit immediate field of the jump instruction
    • The bits 00
  • For main control unit, add an output control signal Jump, which is “1” when the 6-bit op-code matches that of instruction j.
instruction critical paths
Instruction Critical Paths
  • What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:
    • Instruction and Data Memory (200 ps)
    • ALU and adders (200 ps)
    • Register File access (reads or writes) (100 ps)
single cycle disadvantages advantages

Cycle 1

Cycle 2

Clk

lw

sw

Waste

Single Cycle Disadvantages & Advantages
  • One instruction is completed in one single cycle
  • Cycle time has to be chosen as the max time delay
    • i.e., 800 ns
  • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction
    • especially problematic for more complex instructions like floating point multiple cycle

but

  • Is simple and easy to understand
how can we make it faster
How Can We Make It Faster?
  • Start fetching and executing the next instruction before the current one has completed
    • Pipelining – (all?) modern processors are pipelined for performance
    • Remember the performance equation: CPU time = CPI * CC * IC
  • Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
    • A five stage pipeline is nearly five times faster because the CC is nearly five times faster
  • Fetch (and execute) more than one instruction at a time
    • Superscalar processing – stay tuned
the five stages of load instruction

IFetch

Dec

Exec

Mem

WB

The Five Stages of Load Instruction
  • IFetch: Instruction Fetch and Update PC
  • Dec: Registers Fetch and Instruction Decode
  • Exec: Execute R-type; calculate memory address
  • Mem: Read/write the data from/to the Data Memory
  • WB: Write the result data into the register file

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

lw

a pipelined mips processor

IFetch

IFetch

IFetch

Exec

Exec

Exec

Mem

Mem

Mem

WB

WB

WB

A Pipelined MIPS Processor
  • Start the next instruction before the current one has completed
    • improves throughput - total amount of work done in a given time
    • instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

Dec

lw

Dec

sw

Dec

R-type

  • clock cycle (pipeline stage time) is limited by the sloweststage
    • for some stages don’t need the whole clock cycle (e.g., WB)
    • for some instructions, some stages are wastedcycles (i.e., nothing is done during that cycle for that instruction)
single cycle versus pipeline

Single Cycle Implementation (CC = 800 ps):

Cycle 1

Cycle 2

Clk

lw

sw

Waste

Pipeline Implementation (CC = 200 ps):

IFetch

Dec

Exec

Mem

WB

lw

IFetch

Dec

Exec

Mem

WB

sw

IFetch

Dec

Exec

Mem

WB

R-type

Single Cycle versus Pipeline

400 ps

  • To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ?
  • How long does each take to complete 1,000,000 adds ?
pipelining the mips isa
Pipelining the MIPS ISA
  • What makes it easy
    • all instructions are the same length (32 bits)
      • can fetch in the 1st stage and decode in the 2nd stage
    • few instruction formats (three) with symmetry across formats
      • can begin reading register file in 2nd stage
    • memory operations occur only in loads and stores
      • can use the execute stage to calculate memory addresses
    • each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB)
    • operands must be aligned in memory so a single data transfer takes only one data memory access
mips pipeline datapath additions mods

IF/ID

ID/EX

EX/MEM

Add

Add

MEM/WB

4

Shift

left 2

Read Addr 1

Instruction

Memory

Data

Memory

Register

File

Read

Data 1

Read Addr 2

Read

Address

PC

Read

Data

Address

Write Addr

ALU

Read

Data 2

Write Data

Write Data

Sign

Extend

16

32

MIPS Pipeline DatapathAdditions/Mods
  • State registers between each pipeline stage to isolate them

IF:IFetch

ID:Dec

EX:Execute

MEM:

MemAccess

WB:

WriteBack

System Clock

mips pipeline control path modifications
MIPS Pipeline Control Path Modifications
  • All control signals can be determined during Decode
    • and held in the state registers between pipeline stages

PCSrc

ID/EX

EX/MEM

Control

IF/ID

Add

MEM/WB

Branch

Add

4

RegWrite

Shift

left 2

Read Addr 1

Instruction

Memory

Data

Memory

Register

File

Read

Data 1

Read Addr 2

MemtoReg

Read

Address

ALUSrc

PC

Read

Data

Address

Write Addr

ALU

Read

Data 2

Write Data

Write Data

ALU

cntrl

MemRead

Sign

Extend

16

32

ALUOp

RegDst

pipeline control
Pipeline Control
  • IF Stage: read Instr Memory (always asserted) and write PC (on System Clock)
  • ID Stage: no optional control signals to set
graphically representing mips pipeline

DM

Reg

Reg

IM

ALU

Graphically Representing MIPS Pipeline
  • Can help with answering questions like:
    • How many cycles does it take to execute this code?
    • What is the ALU doing during cycle 4?
    • Is there a hazard, why does it occur, and how can it be fixed?
why pipeline for performance

DM

DM

DM

DM

DM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

IM

IM

IM

IM

IM

ALU

ALU

ALU

ALU

ALU

Time to fill the pipeline

Why Pipeline? For Performance!

Time (clock cycles)

Once the pipeline is full, one instruction is completed every cycle, so CPI = 1

Inst 0

I

n

s

t

r.

O

r

d

e

r

Inst 1

Inst 2

Inst 3

Inst 4

can pipelining get us into trouble
Can Pipelining Get Us Into Trouble?
  • Yes:Pipeline Hazards
    • structural hazards: attempt to use the same resource by two different instructions at the same time
    • data hazards: attempt to use data before it is ready
      • An instruction’s source operand(s) are produced by a prior instruction still in the pipeline
    • control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
      • branch and jump instructions, exceptions
  • Can usually resolve hazards by waiting
    • pipeline control must detect the hazard
    • and take action to resolve hazards
a single memory would be a structural hazard

Reading data from memory

Mem

Mem

Mem

Mem

Mem

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Mem

Mem

Mem

Mem

Mem

ALU

ALU

ALU

ALU

ALU

Reading instruction from memory

A Single Memory Would Be a Structural Hazard

Time (clock cycles)

lw

I

n

s

t

r.

O

r

d

e

r

Inst 1

Inst 2

Inst 3

Inst 4

  • Fix with separate instr and data memories (I$ and D$)
how about register file access

DM

DM

DM

DM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

IM

IM

IM

IM

ALU

ALU

ALU

ALU

clock edge that controls loading of pipeline state registers

clock edge that controls register writing

How About Register File Access?

Time (clock cycles)

Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half

add $1,

I

n

s

t

r.

O

r

d

e

r

Inst 1

Inst 2

add $2,$1,

register usage can cause data hazards

DM

DM

DM

DM

DM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

IM

IM

IM

IM

IM

ALU

ALU

ALU

ALU

ALU

Register Usage Can Cause Data Hazards
  • Dependencies backward in time cause hazards

add $1,

sub $4,$1,$5

and $6,$1,$7

or $8,$1,$9

xor $4,$1,$5

  • Read before writedata hazard
loads can cause data hazards

DM

DM

DM

DM

DM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

IM

IM

IM

IM

IM

ALU

ALU

ALU

ALU

ALU

Loads Can Cause Data Hazards
  • Dependencies backward in time cause hazards

lw $1,4($2)

I

n

s

t

r.

O

r

d

e

r

sub $4,$1,$5

and $6,$1,$7

or $8,$1,$9

xor $4,$1,$5

  • Load-usedata hazard
branch instructions cause control hazards

DM

DM

DM

Reg

Reg

Reg

Reg

Reg

Reg

IM

IM

IM

IM

ALU

ALU

ALU

ALU

beq

DM

Reg

Reg

Branch Instructions Cause Control Hazards
  • Dependencies backward in time cause hazards

I

n

s

t

r.

O

r

d

e

r

lw

Inst 3

Inst 4

other pipeline structures are possible

MUL

DM

Reg

Reg

IM

IM

ALU

ALU

DM2

DM1

Reg

Reg

Other Pipeline Structures Are Possible
  • What about the (slow) multiply operation?
    • Make the clock twice as slow or …
    • let it take two cycles (since it doesn’t use the DM stage)
  • What if the data memory access is twice as slow as the instruction memory?
    • make the clock twice as slow or …
    • let data memory access take two cycles (and keep the same clock rate)
other sample pipeline alternatives

Reg

EX

IM

ALU

Other Sample Pipeline Alternatives
  • ARM7
  • XScale

PC update

IM access

decode

reg

access

ALU op

DM access

shift/rotate

commit result

(write back)

Reg

DM2

IM1

DM1

IM2

Reg

SHFT

PC update

BTB access

start IM access

decode

reg 1 access

DM write

reg write

ALU op

start DM access

exception

shift/rotate

reg 2 access

IM access

summary
Summary
  • All modern day processors use pipelining
  • Pipelining doesn’t help latency of single task, it helps throughput of entire workload
  • Potential speedup: CPI = 1
  • Pipeline rate limited by slowest pipeline stage
    • Unbalanced pipe stages makes for inefficiencies
    • The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs
  • Must detect and resolve hazards
    • Stalling negatively affects CPI (makes CPI larger than the ideal of 1)