csci 2500 computer organization n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CSCI-2500: Computer Organization PowerPoint Presentation
Download Presentation
CSCI-2500: Computer Organization

Loading in 2 Seconds...

play fullscreen
1 / 183

CSCI-2500: Computer Organization - PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on

CSCI-2500: Computer Organization. Processor Design. Datapath. The datapath is the interconnection of the components that make up the processor. The datapath must provide connections for moving bits between memory, registers and the ALU. Control.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSCI-2500: Computer Organization' - mira-sparks


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
datapath
Datapath
  • The datapath is the interconnection of the components that make up the processor.
  • The datapath must provide connections for moving bits between memory, registers and the ALU.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 2

control
Control
  • The control is a collection of signals that enable/disable the inputs/outputs of the various components.
  • You can think of the control as the brain, and the datapath as the body.
    • the datapath does only what the brain tells it to do.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 3

processor design
Processor Design

The sequencing and execution of instructions

  • We already know about many of the individual components that are necessary:
    • ALU, Multiplexors, Decoders, Flip-Flops
  • We need to discuss how to use a clock
  • We need to think about registers and memory.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 4

the clock
The Clock

The clock generates a never-ending sequence of alternating 1s and 0s.

All operations are synchronized to the clock.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 5

clocking methodology
Clocking Methodology
  • Determines when (relative to the clock) a signal can be read and written.
    • Read: signal value is used by some component.
    • Written: a signal value is generated by some component.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 6

simple example enabled and
Simple Example: Enabled AND
  • We want an AND gate that holds it’s output value constant until the clock switches from 0 (lo) to 1 (hi).
  • We can use a flip-flop to hold the inputs to the AND gate constant during the time we want the output constant.
  • We use a clocked flip-flop to make things happen when the clock changes.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 7

d flip flop reminder

D

Q

Clock

Q

D Flip-Flop Reminder

The output (Q) changes to reflect D only when the Clock is a 1.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 8

d flip flop timing
D Flip-Flop Timing

1

0

D

1

0

C

1

0

Q

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 9

clocked and gate
Clocked AND gate

A

D

flip-flop

D

Q

C

A•B (clocked)

B

D

flip-flop

D

Q

C

Clock

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 10

edge triggered clocking
Edge-triggered Clocking
  • Values stored are updated (can change) only on a clock edge.
    • When the clock switches from 0 to 1 everybody allows signals in.
      • everybody means state elements
      • combinational elements always do the same thing, they don’t care about the clock (that’s why we added the flip-flops to our AND gate).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 11

state elements
State Elements
  • Any component that stores one or more values is a state element.
    • The entire processor can be viewed as a circuit that moves from one state (collection of all the state elements) to another state.
    • At time i a component uses values generated at time i-1.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 12

register file
Register File

R

e

a

d

r

e

g

i

s

t

e

r

  • Contains multiple registers
    • each holds 32 bits
  • Two output ports (read ports)
  • One input port (write port)
  • To change the value of a register:
    • supply register number
    • supply data
    • clock (the Write control signal)

R

e

a

d

n

u

m

b

e

r

1

d

a

t

a

1

R

e

a

d

r

e

g

i

s

t

e

r

n

u

m

b

e

r

2

R

e

g

i

s

t

e

r

f

i

l

e

W

r

i

t

e

r

e

g

i

s

t

e

r

R

e

a

d

d

a

t

a

2

W

r

i

t

e

d

a

t

a

W

r

i

t

e

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 13

implementation of read ports

R

e

a

d

r

e

g

i

s

t

e

r

n

u

m

b

e

r

1

R

e

g

i

s

t

e

r

0

R

e

g

i

s

t

e

r

1

M

u

R

e

a

d

d

a

t

a

1

x

R

e

g

i

s

t

e

r

n

1

R

e

g

i

s

t

e

r

n

R

e

a

d

r

e

g

i

s

t

e

r

n

u

m

b

e

r

2

M

u

R

e

a

d

d

a

t

a

2

x

Implementation of Read Ports

Figure B.19

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 14

implementation of write

W

r

i

t

e

C

0

R

e

g

i

s

t

e

r

0

1

D

n

-

t

o

-

1

C

R

e

g

i

s

t

e

r

n

u

m

b

e

r

d

e

c

o

d

e

r

R

e

g

i

s

t

e

r

1

D

n

1

n

C

R

e

g

i

s

t

e

r

n

1

D

C

R

e

g

i

s

t

e

r

n

D

R

e

g

i

s

t

e

r

d

a

t

a

Implementation of Write

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 15

memory
Memory
  • Memory is similar to a very large register file:
    • single read port (output)
    • chip select input signal
    • output enable input signal
    • write enable input signal
    • address lines (determine which memory element)
    • data input lines (used to write a memory element)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 16

4 x 2 memory sram

D

i

n

[

1

]

D

i

n

[

0

]

D

D

D

D

C

l

a

t

c

h

Q

C

l

a

t

c

h

Q

W

r

i

t

e

e

n

a

b

l

e

E

n

a

b

l

e

E

n

a

b

l

e

0

2

-

t

o

-

4

D

D

D

D

d

e

c

o

d

e

r

C

l

a

t

c

h

Q

C

l

a

t

c

h

Q

E

n

a

b

l

e

E

n

a

b

l

e

1

D

D

D

D

A

d

d

r

e

s

s

C

l

a

t

c

h

Q

C

l

a

t

c

h

Q

E

n

a

b

l

e

E

n

a

b

l

e

2

D

D

D

D

C

l

a

t

c

h

Q

C

l

a

t

c

h

Q

E

n

a

b

l

e

E

n

a

b

l

e

3

D

o

u

t

[

1

]

D

o

u

t

[

0

]

4 x 2 Memory (SRAM)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 17

memory usage
Memory Usage
  • For now, we treat memory as a single component that supports 2 operations:
    • write (we change the value stored in a memory location)
    • read (we get the value currently stored in a memory location).
  • We can only do one operation at a time!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 18

instruction data memory
Instruction & Data Memory
  • It is useful to treat the memory that holds instructions as a separate component.
    • instruction memory is read-only
  • Typically there is really one memory that holds both instructions and data.
    • as we will see when we talk more about memory, the processor often has two interfaces to the memory, one for instructions and one for data!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 19

designing a datapath for mips
Designing a Datapath for MIPS
  • We start by looking at the datapaths needed to support a simple subset of MIPS instructions:
    • a few arithmetic and logical instructions
    • load and store word
    • beq and j instructions

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 20

functions for mips instructions
Functions for MIPS Instructions
  • We can generalize the functions we need to:
    • using the PC register as the address, read a value from the memory (read the instruction)
    • Read one or two register values (depends on the specific instruction).
    • ALU Operation , Memory read or write, …
    • Possibly change the value of a register.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 21

fetching the next instruction
Fetching the next instruction
  • PC Register holds the address
  • Memory holds the instruction
    • we need to read from memory.
  • Need to update the PC
    • add 4 to current value

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 22

instruction fetch datapath
Instruction Fetch DataPath

A

d

d

4

R

e

a

d

P

C

a

d

d

r

e

s

s

I

n

s

t

r

u

c

t

i

o

n

I

n

s

t

r

u

c

t

i

o

n

m

e

m

o

r

y

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 23

supporting r format instructions

5 bits

5 bits

5 bits

5 bits

6 bits

6 bits

op

rs

rt

rd

shamt

funct

Supporting R-format instructions

Includes add, sub, slt, and & or instructions.

Generalization:

  • read 2 registers and send to ALU.
  • perform ALU operation
  • store result in a register

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 24

mips registers
MIPS Registers
  • MIPS has 32 general purpose registers.
  • Register File holds all 32 registers
    • need 5 bits to select a register
    • rs,rt & rd fields in R-format instructions.
  • MIPS Register File has 2 read ports.
    • can get at both source registers at the same time.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 25

datapath for r format instructions
Datapath for R-format Instructions

A

L

U

o

p

e

r

a

t

i

o

n

3

R

e

a

d

r

e

g

i

s

t

e

r

1

R

e

a

d

d

a

t

a

1

R

e

a

d

Z

e

r

o

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

A

L

U

W

r

i

t

e

r

e

s

u

l

t

r

e

g

i

s

t

e

r

R

e

a

d

d

a

t

a

2

W

r

i

t

e

d

a

t

a

R

e

g

W

r

i

t

e

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 26

load and store instructions
Load and Store Instructions

Need to compute the address

  • offset (part of the instruction)
  • base (stored in a register).

For Load:

  • read from memory
  • store in a register

For Store:

  • read from register
  • write to memory

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 27

computing the address
Computing the address
  • 16 bit signed offset is part of the instruction.
  • We have a 32 bit ALU.
    • need to sign extend the offset (to 32 bits).
  • Feed the 32 bit offset and the contents of a register to the ALU
  • Tell the ALU to “add”.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 28

load store datapath
Load/Store Datapath

A

L

U

o

p

e

r

a

t

i

o

n

3

R

e

a

d

r

e

g

i

s

t

e

r

1

M

e

m

W

r

i

t

e

R

e

a

d

d

a

t

a

1

R

e

a

d

Z

e

r

o

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

A

L

U

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

W

r

i

t

e

r

e

s

u

l

t

A

d

d

r

e

s

s

d

a

t

a

r

e

g

i

s

t

e

r

R

e

a

d

d

a

t

a

2

W

r

i

t

e

D

a

t

a

d

a

t

a

m

e

m

o

r

y

W

r

i

t

e

R

e

g

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

M

e

m

R

e

a

d

e

x

t

e

n

d

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 29

supporting beq
Supporting beq
  • 2 registers compared for equality
  • 16 bit offset used to compute target address.
    • signed offset is relative to the PC
    • offset is in words not in bytes!
  • Might branch, might not (need to decide).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 30

computing target address
Computing target address
  • Recall that the offset is actually relative to the address of the next instruction.
    • we always add 4 to the PC, we must make sure we use this value as the base.
  • Word vs. Byte offset
    • we just need to shift the 16 bit offset 2 bits to the right (fill with 2 zeros).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 31

branch datapath

P

C

+

4

f

r

o

m

i

n

s

t

r

u

c

t

i

o

n

d

a

t

a

p

a

t

h

A

d

d

S

u

m

B

r

a

n

c

h

t

a

r

g

e

t

S

h

i

f

t

l

e

f

t

2

A

L

U

o

p

e

r

a

t

i

o

n

3

R

e

a

d

r

e

g

i

s

t

e

r

1

I

n

s

t

r

u

c

t

i

o

n

R

e

a

d

d

a

t

a

1

R

e

a

d

r

e

g

i

s

t

e

r

2

T

o

b

r

a

n

c

h

R

e

g

i

s

t

e

r

s

A

L

U

Z

e

r

o

c

o

n

t

r

o

l

l

o

g

i

c

W

r

i

t

e

r

e

g

i

s

t

e

r

R

e

a

d

d

a

t

a

2

W

r

i

t

e

d

a

t

a

R

e

g

W

r

i

t

e

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Branch Datapath

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 32

datapaths
Datapaths

We looked at individual datapaths that support:

  • Fetching Instructions
  • Arithmetic/Logical Instructions
  • Load & Store Instructions
  • Conditional branch

We need to combine these in to a single datapath.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 33

issues
Issues
  • When designing one datapath that can be used for any operation:
    • the goal is to be able to handle one instruction per cycle.
    • must make sure no datapath resource needs to be used more than once at the same time.
    • if so – we need to provide more than one!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 34

sharing resources
Sharing Resources
  • We can share datapath resources by adding a multiplexor (and a control line).
    • for example, the second input to the ALU could come from either:
      • a register (as in an arithmetic instruction)
      • from the instruction (as in a load/store – when computing the memory address).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 35

sharing with a multiplexor example
Sharing with a Multiplexor Example

Operand 1

A

A+B (Control==0)

Operand 2

ADD

A+C (Control==1)

B

C

Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 36

combining datapaths for memory instructions and arithmetic instructions
Combining Datapaths for memory instructions and arithmetic instructions
  • Need to share the ALU
    • For memory instructions used to compute the address in memory.
    • For Arithmetic/Logical instructions used to perform arithmetic/logical operation.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 37

slide38

R

e

a

d

r

e

g

i

s

t

e

r

1

R

e

a

d

d

a

t

a

1

R

e

a

d

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

R

e

a

d

R

e

a

d

W

r

i

t

e

d

a

t

a

2

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

u

W

r

i

t

e

x

D

a

t

a

x

d

a

t

a

m

e

m

o

r

y

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

A

L

U

o

p

e

r

a

t

i

o

n

3

M

e

m

W

r

i

t

e

M

e

m

t

o

R

e

g

A

L

U

S

r

c

Z

e

r

o

A

L

U

A

L

U

A

d

d

r

e

s

s

r

e

s

u

l

t

R

e

g

W

r

i

t

e

M

e

m

R

e

a

d

x

t

e

n

d

New Controls

Sharing Multiplexors

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 38

adding the instruction fetch
Adding the Instruction Fetch
  • One memory for instructions, separate memory for data.
    • otherwise we might need to use the memory twice in the same instruction.
  • Dedicated Adder for updating the PC
    • otherwise we might need to use the ALU twice in the same instruction.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 39

slide40

A

d

d

4

R

e

g

i

s

t

e

r

s

R

e

a

d

A

L

U

o

p

e

r

a

t

i

o

n

3

M

e

m

W

r

i

t

e

r

e

g

i

s

t

e

r

1

R

e

a

d

P

C

R

e

a

d

M

e

m

t

o

R

e

g

R

e

a

d

a

d

d

r

e

s

s

d

a

t

a

1

r

e

g

i

s

t

e

r

2

A

L

U

S

r

c

Z

e

r

o

I

n

s

t

r

u

c

t

i

o

n

A

L

U

R

e

a

d

A

L

U

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

r

e

s

u

l

t

d

a

t

a

2

r

e

g

i

s

t

e

r

d

a

t

a

M

M

u

I

n

s

t

r

u

c

t

i

o

n

u

W

r

i

t

e

x

D

a

t

a

x

m

e

m

o

r

y

d

a

t

a

m

e

m

o

r

y

W

r

i

t

e

R

e

g

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

M

e

m

R

e

a

d

e

x

t

e

n

d

Dedicated Adder

Two Memory Units

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 40

need to add datapath for beq
Need to add datapath for beq
  • Register comparison (requires ALU).
  • Another adder to compute target address.
    • One input to adder is sign extended offset, shifted by 2 bits.
    • Other input to adder is PC+4

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 41

slide42

P

C

S

r

c

M

A

d

d

u

x

A

L

U

A

d

d

4

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

g

i

s

t

e

r

s

A

L

U

o

p

e

r

a

t

i

o

n

3

R

e

a

d

M

e

m

W

r

i

t

e

A

L

U

S

r

c

R

e

a

d

r

e

g

i

s

t

e

r

1

P

C

R

e

a

d

a

d

d

r

e

s

s

R

e

a

d

M

e

m

t

o

R

e

g

d

a

t

a

1

Z

e

r

o

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

A

L

U

A

L

U

R

e

a

d

W

r

i

t

e

R

e

a

d

A

d

d

r

e

s

s

r

e

s

u

l

t

M

d

a

t

a

r

e

g

i

s

t

e

r

d

a

t

a

2

M

u

I

n

s

t

r

u

c

t

i

o

n

u

x

W

r

i

t

e

m

e

m

o

r

y

D

a

t

a

x

d

a

t

a

m

e

m

o

r

y

W

r

i

t

e

R

e

g

W

r

i

t

e

d

a

t

a

3

2

1

6

S

i

g

n

M

e

m

R

e

a

d

e

x

t

e

n

d

New adder and mux

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 42

slide43
Whew!
  • Keep in mind that the datapath we now have supports just a few MIPS instructions!
  • Things get worse (more complex) as we support other instructions:

j jal jr addi

  • We won’t worry about them now…

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 43

control unit
Control Unit
  • We need something that can generate the controls in the datapath.
  • Depending on what kind of instruction we are executing, different controls should be turned on (asserted) and off (deasserted).
  • We need to treat each control individually (as a separate boolean function).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 44

controls
Controls
  • Our datapath includes a bunch of controls:
    • ALU operation (3 bits)
    • RegWrite
    • ALUSrc
    • MemWrite
    • MemtoReg
    • MemRead
    • PCSrc

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 45

alu operation control
ALU Operation Control
  • A 3 bit control (assumes the ALU designed in chapter 3):

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 46

alu functions for other instructions
ALU Functions for other instructions

lw , sw (load/store): addition

beq: subtraction

add, sub, and, or, slt (arithmetic/logical): All R-format instructions

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 47

r format instructions

5 bits

5 bits

5 bits

5 bits

6 bits

6 bits

op

rs

rt

rd

shamt

funct

R-Format Instructions

Operation is specified by some bits in the funct field in the instruction.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 48

mips instruction opcodes

6 bits

MIPS Instruction OPCODEs
  • The MS 6 bits are an OPCODE that identifies the instruction.
  • R-Format: always 000000
    • (funct identifies the operation)

lw sw beq

100011 101011 000100

op

varies depending on instruction

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 49

generating alu controls
Generating ALU Controls

We can view the 3 bit ALU control as 3 boolean functions. Inputs are:

  • the op field (OPCODE)
  • funct field (for R-format instructions only)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 50

simplifying the opcode
Simplifying The Opcode

For building the ALU Operation Controls, we are interested in only 4 different opcodes.

We can simplify things by first reducing the 6 bit op field to a 2 bit value we will call ALUOp

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 51

build a truth table
Build a Truth Table
  • We can now build a truth table for the 3 bit ALU control.
  • Inputs are:
    • 2 bit ALUOp
    • 6 bit funct field
  • Abbreviated Truth Table: only show the rows we care about!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 53

slide54

x means “don’t care”

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 54

adding the alu control
Adding the ALU Control
  • We can now add the ALU control to the datapath:
    • inputs to this control come from the instruction and from ALUOp
  • If we try to show all the details the picture becomes too complex:
    • just plop in an “ALU Control” box.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 55

slide56

P

C

S

r

c

1

M

A

d

d

u

x

A

L

U

0

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

R

e

g

W

r

i

t

e

l

e

f

t

2

I

n

s

t

r

u

c

t

i

o

n

[

2

5

2

1

]

R

e

a

d

r

e

g

i

s

t

e

r

1

R

e

a

d

M

e

m

W

r

i

t

e

R

e

a

d

P

C

d

a

t

a

1

I

n

s

t

r

u

c

t

i

o

n

[

2

0

1

6

]

a

d

d

r

e

s

s

R

e

a

d

M

e

m

t

o

R

e

g

A

L

U

S

r

c

r

e

g

i

s

t

e

r

2

Z

e

r

o

I

n

s

t

r

u

c

t

i

o

n

R

e

a

d

1

A

L

U

A

L

U

[

3

1

0

]

1

R

e

a

d

W

r

i

t

e

d

a

t

a

2

1

A

d

d

r

e

s

s

r

e

s

u

l

t

M

r

e

g

i

s

t

e

r

M

d

a

t

a

u

M

I

n

s

t

r

u

c

t

i

o

n

u

I

n

s

t

r

u

c

t

i

o

n

[

1

5

1

1

]

x

W

r

i

t

e

u

x

m

e

m

o

r

y

R

e

g

i

s

t

e

r

s

x

0

d

a

t

a

0

D

a

t

a

0

W

r

i

t

e

m

e

m

o

r

y

R

e

g

D

s

t

d

a

t

a

1

6

3

2

S

i

g

n

I

n

s

t

r

u

c

t

i

o

n

[

1

5

0

]

e

x

t

e

n

d

A

L

U

M

e

m

R

e

a

d

c

o

n

t

r

o

l

I

n

s

t

r

u

c

t

i

o

n

[

5

0

]

A

L

U

O

p

Shows which bits from the instruction

are fed to register file inputs

ALU Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 56

implementing other controls
Implementing Other Controls
  • The other controls in out datapath must also be specified as functions.
  • We need to determine the inputs to all the functions.
    • primarily the inputs are part of the instructions, but there are exceptions.
  • Need to define precisely what conditions should turn on each control.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 57

regdst control line

op

rs

rt

rd

shamt

funct

op

rs

rt

address

RegDst Control Line
  • Controls a multiplexor that selects on of the fields rt or rd from an R-format or I-format instruction.
  • I-Format is used for load and store.
  • sw needs to write to the register rt.

I-format

R-format

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 58

regdst usage
RegDst usage
  • RegDst should be
    • 0 to send rt to the write register # input.
    • 1 to send rd to the write register # input.
  • RegDst is a function of the opcode field:
    • If instruction is sw, RegDst should be 0
    • For all other instrs RegDst should be 1

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 59

regwrite control
RegWrite Control
  • a 1 tells the register file to write a register.
    • whatever register is specified by the write register # input is written with the data on the write register data inputs.
  • Should be a 1 for arithmetic/logical instructions and for a store.
  • Should be a 0 for load or beq.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 60

alusrc control
ALUSrc Control
  • MUX that selects the source for the second ALU operand.
    • 1 means select the second register file output (read data 2).
    • 0 means select the sign-extended 16 bit offset (part of the instruction).
  • Should be a 1 for load and store.
  • Should be a 0 for everything else.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 61

slide62

MemRead Control

  • A 1 tells the memory to put the contents of the memory location (specified by the address lines) on the Read data output.
  • Should be a 1 for load.
  • Should be a 0 for everything else.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 62

slide63

MemWrite Control

  • 1 means that memory location (specified by memory address lines) should get the value specified on the memory Write Data input.
  • Should be a 1 for store.
  • Should be a 0 for everything else.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 63

slide64

MemToReg Control

  • MUX that selects the value to be stored in a register (that goes to register write data input).
    • 1 means select the value coming from the memory data output.
    • 0 means select value coming from the ALU output.
  • Should be a 1 for load and any arithmetic/logical instructions.
  • Should be a 0 for everything else (sw, beq).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 64

slide65

PCSrc Control

  • MUX that selects the source for the value written in to the PC register.
    • 1 means select the output of the Adder used to compute the relative address for a branch.
    • 0 means select the output of the PC+4 adder.
  • Should be a 1 for beq if registers are equal!
  • Should be a 0 for other instructions or if registers are different.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 65

pcsrc depends on result of alu operation
PCSrc depends on result of ALU operation!
  • This control line can’t be simply a function of the instruction (all the others can).
  • PCSrc should be a 1 only when:
    • beq AND ALU zero output is a 1
  • We will generate a signal called “branch” that we can AND with the ALU zero output.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 66

truth table for control
Truth Table for Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 67

slide68

0

M

u

x

A

L

U

A

d

d

1

r

e

s

u

l

t

A

d

d

S

h

i

f

t

P

C

S

r

c

l

e

f

t

2

R

e

g

D

s

t

4

B

r

a

n

c

h

M

e

m

R

e

a

d

M

e

m

t

o

R

e

g

I

n

s

t

r

u

c

t

i

o

n

[

3

1

2

6

]

C

o

n

t

r

o

l

A

L

U

O

p

M

e

m

W

r

i

t

e

A

L

U

S

r

c

R

e

g

W

r

i

t

e

I

n

s

t

r

u

c

t

i

o

n

[

2

5

2

1

]

R

e

a

d

R

e

a

d

r

e

g

i

s

t

e

r

1

P

C

R

e

a

d

a

d

d

r

e

s

s

d

a

t

a

1

I

n

s

t

r

u

c

t

i

o

n

[

2

0

1

6

]

a

d

R

e

Z

e

r

o

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

0

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

[

3

1

0

]

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

M

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

I

n

s

t

r

u

c

t

i

o

n

r

e

g

i

s

t

e

r

M

u

M

u

m

e

m

o

r

y

x

u

I

n

s

t

r

u

c

t

i

o

n

[

1

5

1

1

]

W

r

i

t

e

x

D

a

t

a

1

x

d

a

t

a

1

m

e

m

o

r

y

0

W

r

i

t

e

d

a

t

a

1

6

3

2

I

n

s

t

r

u

c

t

i

o

n

[

1

5

0

]

S

i

g

n

e

x

t

e

n

d

A

L

U

c

o

n

t

r

o

l

I

n

s

t

r

u

c

t

i

o

n

[

5

0

]

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 68

single cycle instructions
Single Cycle Instructions
  • View the entire datapath as a combinational circuit.
  • We can follow the flow of an instruction through the datapath.
    • single cycle instruction means that there are not really any steps – everything just happens and becomes finalized when the clock cycle is over.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 69

add t1 t2 t3
add $t1,$t2,$t3
  • Control Lines:
    • ALU Controls specify an ALU add operation.
    • RegWrite will be a 1 so that when the clock cycle ends the value on the Register Write Input lines will be written to a register.
    • all other control lines are 0.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 70

lw t1 offset t2
lw $t1,offset($t2)
  • Control Lines:
    • ALU Control set for an add operation.
    • ALUSrc is set to 1 to indicate the second operand is sign extended offset.
    • MemRead would be a 1.
    • RegDst would select the correct bits from the instruction to specify the dest. register.
    • RegWrite would be a 1.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 71

disadvantage of single cycle operation
Disadvantage of single cycle operation

If we have instructions execute in a single cycle, then the cycle time must be long enough for the slowest instruction.

  • all instructions take the same time as the slowest.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 72

multicycle implementation
Multicycle Implementation
  • Chop up the processing of instructions in to discrete stages.
  • Each stage takes one clock cycle.
    • we can implement each stage as a big combinational circuit (like we just did for the whole thing).
    • provide some way to sequence through the stages.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 73

advantages of multicycle
Advantages of Multicycle
  • Only need those stages required by an instruction.
    • the control unit is more complex, but instructions only take as long as necessary.
  • We can share components
    • perhaps 2 different stages can use the same ALU.
    • We don’t need to duplicate resources.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 74

additional resources for multicycle
Additional Resources for Multicycle
  • To implement a multicycle implementation we need some additional registers that can be used to hold intermediate values.
    • instruction
    • computed address
    • result of ALU operation

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 75

slide76

I

n

s

t

r

u

c

t

i

o

n

r

e

g

i

s

t

e

r

D

a

t

a

P

C

A

d

d

r

e

s

s

A

R

e

g

i

s

t

e

r

#

I

n

s

t

r

u

c

t

i

o

n

A

L

U

A

L

U

O

u

t

M

e

m

o

r

y

R

e

g

i

s

t

e

r

s

o

r

d

a

t

a

R

e

g

i

s

t

e

r

#

M

e

m

o

r

y

d

a

t

a

B

D

a

t

a

r

e

g

i

s

t

e

r

R

e

g

i

s

t

e

r

#

Multicycle Datapath

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 76

multicycle datapath for mips
Multicycle Datapath for MIPS

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 77

mc dp with control
MC DP with Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 78

instruction stages
Instruction Stages
  • Instruction Fetch
  • Instruction decode/register fetch
  • ALU operation/address computation
  • Memory Access
  • Register Write

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 79

complete multicyle datapath control
Complete Multicyle Datapath & Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 80

instruction fetch decode if id state machine
Instruction Fetch/Decode (IF/ID) State Machine

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 81

memory reference state machine
Memory Reference State Machine

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 82

r type instruction state machine
R-type Instruction State Machine

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 83

branch jump state machine
Branch/Jump State Machine

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 84

put it all together
Put it all together!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 85

control for multicycle
Control for Multicycle
  • Need to define the controls
  • Need to come up with some way to sequence the controls
    • Two techniques
      • finite state machine
      • microprogramming

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 86

finite state machine
Finite State Machine

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 87

microprogramming sec 5 7
MicroProgramming (sec. 5.7)
  • The idea is to build a (very small) processor to generate the controls signals at the right time.
  • At each stage (cycle) one microinstruction is executed – the result changes the value of the control signals.
  • Somebody writes the microinstructions that make up each MIPS instruction.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 88

example microinstructions
Example microinstructions

Fetch next instruction:

  • turn on instruction memory read
  • feed PC to memory address input
  • write memory data output in to a holding register.

Compute Address:

  • route contents of base register to ALU
  • route sign-extended offset to ALU
  • perform ALU add
  • write ALU output in to a holding register.

microinstruction

Control Signals

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 89

sequencing
Sequencing
  • In addition to setting some control signals, each microinstruction must specify the next microinstruction that should be executed.
  • 3 Options:
    • execute next microinstruction (default)
    • start next MIPS instruction (Fetch)
    • Dispatch (depends on control unit inputs).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 90

microinstruction format
Microinstruction Format
  • A bunch of bits – one for each control line needed by the control unit.
    • bits specify the values of the control lines directly.
  • Some bits that are used to determine the next microinstruction executed.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 91

dispatch sequencing
Dispatch Sequencing
  • Can be implemented as a table lookup.
    • bits in the microinstruction tell what row in the table.
    • inputs to the control unit tell what column.
    • value stored in table determines the microaddress of the next microinstruction.
  • This is a simplified description (called a microdescription)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 92

exceptions interrupts
Exceptions & Interrupts
  • Hardest part of control is implementing exceptions and interrupts – i.e., events that change the normal flow of instruction execution.
  • MIPS convention
    • Exception refers to any unexpected change in control flow w/o knowing if the cause is internal or external.
    • Interrupts refer to only events who are externally caused.
    • Ex. Interrupts: I/O device request (ignore for now)
    • Ex. Exceptions: undefined instruction, arithmetic overflow

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 93

handling exceptions
Handling Exceptions
  • Let’s implemented exceptions for handling
    • Undefined instruction
    • Overflow
  • Basic actions
    • Save the offending instruction address in the Exception Program Counter (EPC).
    • Transfer control to the OS at some specified address
    • Once exception is handled by OS, then either terminate the program or continue on using the EPC to determine where to restart.
  • OS actions are determined based on what caused the exception.
    • So, OS needs a Cause register which determines which path w/i the exception
    • Alternative implementation – Vectored Interrupts – where each cause of an exception or interrupt is given a specific OS address to jump to.
    • We’ll use the first method.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 94

extending the multicycle d c
Extending the Multicycle D&C
  • What datapath elements to add?
    • EPC: a 32-bit register used to hold the address of the affected instruction.
    • Cause: A 32-bit register used to record the cause of the exception. (undef instruction = 0 and overflow = 1).
  • What control lines to add?
    • EPCWrite and Cause write control signals to allow regs to be written.
    • IntCause (1-bit) control signal to set the low-order bit of the cause register to the appropriate value.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 95

revised datapath control
Revised Datapath & Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 96

final fsm w exception handling
Final FSM w/ exception handling

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 97

multicycle instructions
Multicycle Instructions
  • Chop each instruction in to stages.
  • Each stage takes one cycle.
  • We need to provide some way to sequence through the stages:
    • microinstructions
  • Stages can share resources (ALU, Memory).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 99

pipelining1
Pipelining
  • We can overlap the execution of multiple instructions.
  • At any time, there are multiple instructions being executed – each in a different stage.
  • So much for sharing resources ?!?

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 100

the laundry analogy
The Laundry Analogy

Non-pipelined approach:

  • run 1 load of clothes through washer
  • run load through dryer
  • fold the clothes (optional step for students)
  • put the clothes away (also optional).

Two loads? Start all over.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 101

pipelined laundry
Pipelined Laundry
  • While the first load is drying, put the second load in the washing machine.
  • When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.
  • Admittedly unrealistic scenario for CS students, as most only own 1 load of clothes…

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 102

slide103

1

6

P

M

7

8

9

1

0

1

1

1

2

2

A

M

T

i

m

e

T

a

s

k

o

r

d

e

r

A

B

C

D

1

6

P

M

7

8

9

1

0

1

1

1

2

2

A

M

T

i

m

e

T

a

s

k

o

r

d

e

r

A

B

C

D

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 103

laundry performance
Laundry Performance
  • For 4 loads:
    • non-pipelined approach takes 16 units of time.
    • pipelined approach takes 7 units of time.
  • For 816 loads:
    • non-pipelined approach takes 3264 units of time.
    • pipelined approach takes 819 units of time.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 104

execution time vs throughput
Execution Time vs. Throughput
  • It still takes the same amount of time to get your favorite pair of socks clean, pipelining won’t help.
  • However, the total time spent away from CompOrg homework is reduced.

It's the classic “Socks vs. CompOrg” issue.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 105

instruction pipelining
Instruction Pipelining

First we need to break instruction execution into discrete stages:

  • Instruction Fetch
  • Instruction Decode/ Register Fetch
  • ALU Operation
  • Data Memory access
  • Write result into register

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 106

operation timings
Operation Timings
  • Some estimated timings for each of the stages:

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 107

comparison

P

r

o

g

r

a

m

e

x

e

c

u

t

i

o

n

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

.

.

.

1

4

2

4

6

8

1

0

1

2

T

i

m

e

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

1

,

1

0

0

(

$

0

)

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

2

,

2

0

0

(

$

0

)

200

p

s

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

l

w

$

3

,

3

0

0

(

$

0

)

Comparison

2

4

6

8

1

0

1

2

1

4

1

6

1

8

T

i

m

e

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

1

,

1

0

0

(

$

0

)

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

2

,

2

0

0

(

$

0

)

R

e

g

A

L

U

R

e

g

800

p

s

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

l

w

$

3

,

3

0

0

(

$

0

)

800

p

s

f

e

t

c

h

800

p

s

P

r

o

g

r

a

m

e

x

e

c

u

t

i

o

n

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

200

p

s

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

200

p

s

200

p

s

200

p

s

200

p

s

200

p

s

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 108

risc and pipelining
RISC and Pipelining
  • One of the major advantages of RISC instruction sets is the relative simplicity of a pipeline implementation.
    • It’s much more complex in a CISC processor!!
  • RISC (MIPS) design features that make pipelining easy include:
    • single length instruction (always 1 word)
    • relatively few instruction formats
    • load/store instruction set
    • operands must be aligned in memory (a single data transfer instruction requires a single memory operation).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 109

pipeline hazard
Pipeline Hazard
  • Something happens that means the next instruction cannot execute in the following clock cycle.
  • Three kinds of hazards:
    • structural hazard
    • control hazard
    • data hazard

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 110

structural hazards
Structural Hazards
  • Two stages require the same resource.
    • What if we only had enough electricity to run either the washer or the dryer at any given time?
    • What if MIPS datapath had only one memory unit instead of separate instruction and data memory?

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 111

avoiding structural hazards
Avoiding Structural Hazards
  • Design the pipeline carefully.
  • Might need to duplicate resources
    • an Adder to update PC, and ALU to perform other operations.
  • Detecting structural hazards at execution time (and delaying execution) is not something we want to do (structural hazards are minimized in the design phase).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 112

control hazards
Control Hazards
  • When one instruction needs to make a decision based on the results of another instruction that has not yet finished.
  • Example: conditional branch
    • The instruction that is fed to the pipeline right after a beq depends on whether or not the branch is taken.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 113

beq control hazard

slt

beq

???

beq Control Hazard

a = b+c;

if (x!=0)

y++;

...

slt $t0,$s0,$s1

beq $t0,$zero,skip

addi $s0,$s0,1

skip:

lw $s3,0($t3)

The instruction to follow the beq could be either the addi or the lw, it depends on the result of the beq instruction.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 114

one possible solution stall
One possible solution - stall
  • We can include in the control unit the ability to stall (to keep new instructions from entering the pipeline until we know which one).
  • Unfortunately conditional branches are very common operations, and this would slow things down considerably.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 115

a stall

P

r

o

g

r

a

m

e

x

e

c

u

t

i

o

n

1

4

1

6

2

4

6

8

1

0

1

2

T

i

m

e

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

R

e

g

A

L

U

R

e

g

a

d

d

$

4

,

$

5

,

$

6

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

b

e

q

$

1

,

$

2

,

4

0

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

2

n

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

3

,

3

0

0

(

$

0

)

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

4

n

s

2

n

s

A Stall

To achieve a 1 cycle stall (as shown above), we need to modify the implementation of the beq instruction so that the decision is made by the end of the second stage.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 116

another strategy
Another strategy
  • Predict whether or not the branch will be taken.
  • Go ahead with the predicted instruction (feed it into the pipeline next).
  • If your prediction is right, you don't lose any time.
  • If your prediction is wrong, you need to undo some things and start the correct instruction

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 117

predicting branch not taken

P

r

o

g

r

a

m

2

4

6

8

1

0

1

2

1

4

e

x

e

c

u

t

i

o

n

T

i

m

e

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

a

d

d

$

4

,

$

5

,

$

6

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

b

e

q

$

1

,

$

2

,

4

0

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

2

n

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

l

w

$

3

,

3

0

0

(

$

0

)

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

2

n

s

P

r

o

g

r

a

m

1

4

2

4

6

8

1

0

1

2

e

x

e

c

u

t

i

o

n

T

i

m

e

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

a

d

d

$

4

,

$

5

,

$

6

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

b

e

q

$

1

,

$

2

,

4

0

R

e

g

A

L

U

R

e

g

f

e

t

c

h

a

c

c

e

s

s

2

n

s

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

I

n

s

t

r

u

c

t

i

o

n

D

a

t

a

R

e

g

A

L

U

R

e

g

o

r

$

7

,

$

8

,

$

9

f

e

t

c

h

a

c

c

e

s

s

4

n

s

Predicting branch not taken

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 118

dynamic branch prediction
Dynamic Branch Prediction
  • The idea is to build hardware that will come up with a prediction based on the past history of the specific branch instruction.
  • Predict the branch will be taken if it has been taken more often than not in the recent past.
    • This works great for loops! (90% + correct).
    • We’ll talk more about this …

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 119

yet another strategy delayed branch
Yet another strategy: delayed branch
  • The compiler rearranges instructions so that the branch actually occurs delayed by one instruction from where its execution starts
  • This gives the hardware time to compute the address of the next instruction.
  • The new instruction is hopefully useful whether or not the branch is taken (this is tricky - compilers must be careful!).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 120

delayed branch

beq

add

Delayed Branch

a = b+c;

if (x!=0)

y++;

...

Order reversed!

add $s2,$s3,$s4

beq $t0,$zero,skip

addi $s0,$s0,1

skip:

lw $s3,0($t3)

The compiler must generate code that differs from what you would expect.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 121

data hazard
Data Hazard
  • One of the values needed by an instruction is not yet available (the instruction that computes it isn't done yet).
  • This will cause a data hazard:

add $t0,$s1,$s2

addi $t0,$t0,17

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 122

slide123

adds $s1 and $s2

selects $s1 and $s2 for ALU op

stores sum in $t0

IF

Reg

ALU

Data

Access

Reg

add $t0,$s1,$s2

IF

Reg

ALU

Data

Access

Reg

addi $t0,$t0,17

time

selects $t0 for ALU op

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 123

handling data hazards
Handling Data Hazards
  • We can hope that the compiler can arrange instructions so that data hazards never appear.
    • this doesn't work, as programs generally need to use previously computed values for everything!
  • Some data hazards aren't real - the value needed is available, just not in the right place.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 124

slide125

ALU has finished computing sum

IF

Reg

ALU

Data

Access

Reg

add $t0,$s1,$s2

IF

Reg

ALU

Data

Access

Reg

addi $t0,$t0,17

time

ALU needs sum from the previous ALU operation

The sum is available when needed!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 125

forwarding
Forwarding
  • It's possible to forward the value directly from one resource to another (in time).
  • Hardware needs to detect (and handle) these situations automatically!
    • This is difficult, but necessary.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 126

slide127

P

r

o

g

r

a

m

e

x

e

c

u

t

i

o

n

2

4

6

8

1

0

o

r

d

e

r

T

i

m

e

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

a

d

d

$

s

0

,

$

t

0

,

$

t

1

I

F

I

D

E

X

M

E

M

W

B

s

u

b

$

t

2

,

$

s

0

,

$

t

3

M

E

M

I

F

I

D

E

X

M

E

M

W

B

Picture of Forwarding

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 127

another example

2

4

6

8

1

0

1

2

1

4

T

i

m

e

P

r

o

g

r

a

m

e

x

e

c

u

t

i

o

n

o

r

d

e

r

(

i

n

i

n

s

t

r

u

c

t

i

o

n

s

)

I

F

I

D

l

w

$

s

0

,

2

0

(

$

t

1

)

E

X

M

E

M

W

B

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

b

u

b

b

l

e

s

u

b

$

t

2

,

$

s

0

,

$

t

3

I

F

I

D

W

B

E

X

M

E

M

Another Example

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 128

pipelining and cpi
Pipelining and CPI
  • If we keep the pipeline full, one instruction completes every cycle.
  • Another way of saying this: the average time per instruction is 1 cycle.
    • even though each instruction actually takes 5 cycles (5 stage pipeline).

CPI=1

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 129

correctness
Correctness

Pipeline and compiler designers must be careful to ensure that the various schemes to avoid stalling do not change what the program does!

  • only when and how it does it.
  • It's impossible to test all possible combinations of instructions (to make sure the hardware does what is expected).
  • It's impossible to test all combinations even without pipelining!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 130

pipelined datapath
Pipelined Datapath

We need to use a multicycle datapath.

  • includes registers that store the result of each stage (to pass on to the next stage).
  • can't have a single resource used by more than one stage at time.

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 131

slide132

0

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

n

R

e

a

d

o

i

r

e

g

i

s

t

e

r

1

t

A

d

d

r

e

s

s

P

C

c

R

e

a

d

u

r

d

a

t

a

1

t

R

e

a

d

s

n

Z

e

r

o

I

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

1

r

e

s

u

l

t

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

D

a

t

a

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Pipelined Datapath – 5 stages

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 132

lw and pipelined datapath
lw and pipelined datapath
  • We can trace the execution of a load word instruction through the datapath.
  • We need to keep in mind that other instructions are using the stages not in use by our lw instruction!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 133

slide134

l

w

I

n

s

t

r

u

c

t

i

o

n

f

e

t

c

h

0

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

t

P

C

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

D

a

t

a

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 1: Instruction Fetch (IF)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 134

slide135

l

w

0

I

n

s

t

r

u

c

t

i

o

n

d

e

c

o

d

e

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

P

C

t

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

D

a

t

a

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 2: Instruction Decode (ID)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 135

slide136

l

w

0

I

n

s

t

r

u

c

t

i

o

n

d

e

c

o

d

e

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

P

C

t

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

D

a

t

a

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 2: Instruction Decode (ID)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 136

slide137

l

w

0

E

x

e

c

u

t

i

o

n

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

t

P

C

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

M

u

D

a

t

a

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 3: Execute (EX)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 137

slide138

l

w

0

M

e

m

o

r

y

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

t

P

C

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

D

a

t

a

M

u

u

m

e

m

o

r

y

W

r

i

t

e

x

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 4: Memory Access (MEM)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 138

slide139

l

w

0

M

u

W

r

i

t

e

b

a

c

k

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

R

e

a

d

n

o

i

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

t

P

C

R

e

a

d

c

u

d

a

t

a

1

r

t

R

e

a

d

s

Z

e

r

o

n

r

e

g

i

s

t

e

r

2

I

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

D

a

t

a

M

u

m

e

m

o

r

y

u

W

r

i

t

e

x

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

Stage 5: WriteBack (WB)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 139

a bug
A Bug!
  • When the value read from memory is written back to the register file, the inputs to the register file (write register #) are from a different instruction!
  • To fix the bug we need to save the part of the lw instruction (5 bits of it specify which register should get the value from memory).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 140

new datapath

0

M

u

x

1

I

F

/

I

D

I

D

/

E

X

E

X

/

M

E

M

M

E

M

/

W

B

A

d

d

A

d

d

4

A

d

d

r

e

s

u

l

t

S

h

i

f

t

l

e

f

t

2

n

R

e

a

d

o

i

t

r

e

g

i

s

t

e

r

1

A

d

d

r

e

s

s

P

C

c

R

e

a

d

u

r

d

a

t

a

1

t

s

R

e

a

d

n

Z

e

r

o

I

r

e

g

i

s

t

e

r

2

I

n

s

t

r

u

c

t

i

o

n

R

e

g

i

s

t

e

r

s

A

L

U

R

e

a

d

A

L

U

m

e

m

o

r

y

0

R

e

a

d

W

r

i

t

e

A

d

d

r

e

s

s

d

a

t

a

2

r

e

s

u

l

t

1

d

a

t

a

r

e

g

i

s

t

e

r

M

M

D

a

t

a

u

u

W

r

i

t

e

x

m

e

m

o

r

y

x

d

a

t

a

1

0

W

r

i

t

e

d

a

t

a

1

6

3

2

S

i

g

n

e

x

t

e

n

d

New Datapath

Figure 4.41

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 141

store word sw data path flow ex
Store Word (sw) Data Path Flow (EX)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 142

sw data path cont
SW Data Path (cont.)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 143

final corrected datapath
Final Corrected Datapath

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 144

ex with 5 instructions
Ex. With 5 instructions

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 145

ex alt view
Ex: Alt View

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 146

pipelined dp w signals
Pipelined DP w/ signals

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 148

control lines for pipeline stages
Control lines for pipeline stages

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 149

pipelined dp w control
Pipelined DP w/ Control

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 150

pipelined dependencies
Pipelined Dependencies

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 151

pipeline w forwarding values
Pipeline w/ Forwarding Values

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 152

alu regs b4 after fwding
ALU & Regs: B4, After Fwding

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 153

datapath w forwarding
Datapath w/ forwarding

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 154

forwarding control table
Forwarding Control Table

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 155

forwarding control table cont
Forwarding Control Table (cont.)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 156

ex stage hazard detection resolution
EX Stage Hazard Detection & Resolution
  • if( EX/MEM.RegWrite && EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRs )

ForwardA = 10

  • if( EX/MEM.RegWrite && EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRt )

ForwardB = 10

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 157

mem stage hazard detection resolution
Mem Stage Hazard Detection & Resolution
  • if( MEM/WB.RegWrite && MEM/WB.RegisterRd != 0 && EX/MEM.RegisterRd != ID/EX.RegisterRs && MEM/WB.RegisterRd = ID/EX.RegisterRs)
    • ForwardA = 01
  • if( MEM/WB.RegWrite && MEM/WB.RegisterRd != 0 && EX/MEM.RegisterRd != ID/EX.RegisterRt && MEM/WB.RegisterRd = ID/EX.RegisterRt)
    • ForwardB = 01
  • Does not consider hazard between WB stage results, MEM stage result & EX stage ALU src

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 158

data hazards stalls
Data Hazards & Stalls
  • Need Hazard detection unit in addition to forwarding unit.
  • Check for Load Instructions based on…
    • if( ID/EX.MemRead && (ID/EX.RegisterRt==IF/ID.RegisterRs || ID/EX.RegisterRt==IF/ID.RegisterRt)) StallThePipeline

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 159

where forwarding fails must stall
Where Forwarding Fails…must stall

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 160

how stalls are inserted
How Stalls Are Inserted

AND 

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 161

pipelined control w fwding hazard detection
Pipelined control w/ fwding & hazard detection

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 162

what about those crazy branches
What about those crazy branches?

Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 163

branch hazards assume branch not taken
Branch Hazards: Assume Branch Not Taken
  • Recall stalling until branch is complete is too ssssssllllooooowwww!!
  • So, assume the branch is not taken…
  • If taken, instructions fetched/decoded must be discarded or “squashed”
  • discard instructions, just change the original control values to 0’s (similar to load-use hazard),
  • BIG DIFFERENCE: must flush the pipeline in the IF, ID and EX stages
  • How can we reduce the “flush” costs when a branch is taken?

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 164

reducing the delay of branches
Reducing the Delay of Branches
  • Let’s move the branch execution earlier in the pipeline.
  • EFFECT: fewer instructions need to be flushed.
  • NEED two actions:
    • Compute branch target address (EASY – can do on IF/ID stage).
    • Eval of branch decision (HARD)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 165

faster branch decision
Faster Branch Decision
  • Recall, for BEQ instruction, we would compare two regs during the ID stage and test for equality.
  • Equality can be tested by XORing the two regs. (a.k.a. equality unit)
  • Need additional ID stage forwarding and hazard detection hardware
  • This has 2 complicating factors…

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 166

faster branch decison complex factors
Faster Branch Decison: Complex Factors
  • In ID stage, now we need to decide whether a “bypass” path to the “equality” unit is needed.
    • ALU forwarding logic is not sufficient, and so we need new forwarding logic for the equality unit.
  • Can stall due to a data hazard.
    • if an r-type instruction comes before the branch who operands are used in the comparision in the branch, a stall is needed

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 167

example pipelined branch
Example Pipelined Branch

36 sub $10, $4, $8

40 beq $1, $3, 7 # jmp2 40+4+(7*4) = 72

44 and $12, $2, $5

48 or $13, $2, $6

52 and $14, $4, $2

56 slt $15, $6, $7

……..

72 lw $4, 50($7)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 168

branch processing example
Branch Processing Example

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 169

dynamic branch prediction1
Dynamic Branch Prediction
  • From the phase “There is no such thing as a typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.
  • Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not.
  • Implementation: branch prediction buffer or branch history table.
    • Index based on lower part of branch address
    • Single bit indicates if branch at address was last taken or not. (1 or 0)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 170

problem with 1 bit branch predictors
Problem with 1-bit Branch Predictors
  • Consider a loop branch
    • Suppose it occurs 9 times in a row, then is not taken.
    • What’s the branch prediction accuracy?
    • ANSWER: 1-bit predictor will mispredict the entry and exit points of the loop.
    • Yields only an 80% accuracy when there is potential for 90% (i.e., you have to guess wrong on the exit of the loop).

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 171

solution 2 bit branch predictor
Solution: 2-bit Branch Predictor

Must be wrong twice before changing predictionLearns if the branch is more biased towards “taken” or “not taken”

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 172

performance single vs multicycle vs pl
Performance: Single vs Multicycle vs. PL
  • Assume: 200 ps for memory access, 100 ps for ALU ops, 50 ps for register access
  • Single-cycle clock cycle:
    • 600 ps: 200 + 50 + 100 + 200 + 50
  • Futher assume instruction mix
    • 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU instructions
    • Assume CPI for multi-cycle is 3.50
    • Multicycle clock cycle: must be longest unit which is 200 ps
    • Total time for an “avg” instruction is 3.5 * 200 ps = 700ps

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 173

pipeline performance cont
Pipeline performance (cont)
  • For pipelined design…
    • Loads take 1 cycle when no load-use dependence and 2 cycles when there is yielding an average of 1.5 cycles per load.
    • Stores and ALU instructions take 1 cycle.
    • Branches take 1 cycle when predicted correctly and 2 cycles when not. Assume 75% accuracy, average branch cycles is 1.25.
    • Jumps are 2 cycles.
    • Avg CPI then is:

1.5 x 25% + 1 x 10% + 1 x 52% + 1.25 x 11% + 2 x 2% = 1.17

    • Longest stage is 200 ps, so 200 x 1.17 = 234 ps

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 174

even more performance
Even more performance…
  • Ultimately we want greater and greater Instruction Level Parallelism (ILP)
  • How?
  • Multiple instruction issue.
    • Results in CPI’s less than one.
    • Here, instructions are grouped into “issue slots”.
    • So, we usually talk about IPC (instructions per cycle)
    • Static: uses the compiler to assist with grouping instructions and hazard resolution. Compiler MUST remove ALL hazards.
    • Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 175

example static 2 issue datapath
Example Static 2-issue Datapath
  • Additions include:
  • 32 bits from intr. Mem
  • Two read, 1 write ports on reg file
  • 1 more ALU (top handles address calc)

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 176

ex 2 issue code schedule
Ex. 2-Issue Code Schedule

Loop: lw $t0, 0($s1) #t0=array element

addiu $t0, $t0, $s2 #add scalar in $s2

sw $t0, 0($s1) #store result

addi $s1, $s1, -4 # dec pointer

bne $s1, $zero, Loop # branch $s1!=0

It take 4 clock cycles for 5 instructions or IPC of 1.25

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 177

more performance loop unrolling
More Performance: Loop Unrolling
  • Technique where multiple copies of the loop body are made.
  • Make more ILP available by removing dependencies.
  • How? Complier introduces additional registers via “register renaming”.
  • This removes “name” or “anti” dependence
    • where an instruction order is purely a consequence of the reuse of a register and not a real data dependence.
    • Ex. lw $t0, 0($s1), addu $t0, $t0, $s2 and sw $t0, 4($s1)
    • No data values flow between one pair and the next pair
    • Let’s assume we unroll a block of 4 interations of the loop..

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 178

loop unrolling schedule
Loop Unrolling Schedule

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 179

performance of instruction schedule
Performance of Instruction Schedule
  • 12 of 14 instructions execute in a pair.
  • Takes 8 clock cycles for 4 loop iterations
  • Yields 2 clock cycles per iteration
  • CPI = 8/14  0.57
  • 2.19x speedup over regular loop
  • Cost of improvement: 4 temp regs + lots of additional code

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 180

dynamic scheduled pipeline
Dynamic Scheduled Pipeline

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 181

intel p4 dynamic pipeline
Intel P4 Dynamic Pipeline

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 182

summary of pipeline technology
Summary of Pipeline Technology

We’ve exhausted this!!

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 183