1 / 56

# Lecture 5. Dynamic Scheduling II - PowerPoint PPT Presentation

COM515 Advanced Computer Architecture. Lecture 5. Dynamic Scheduling II. Prof. Taeweon Suh Computer Science Education Korea University. Modern Processors. Branch Prediction results in speculative execution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Lecture 5. Dynamic Scheduling II' - jared

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lecture 5. Dynamic Scheduling II

Prof. Taeweon Suh

Computer Science Education

Korea University

• Branch Prediction results in speculative execution

• Speculative instructions (if wrongly speculated) must not alter the architecture states

• Architecture Registers

• Memory

• Requirement of precise exception/interrupts

Prof. Sean Lee’s Slide

RAT

RS

ARF

Modern Out-of-Order Core

Reservation Station issues instructions to functional units

Allocate instructions

Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution

ROB

Architectural register file

LSQ

Register Alias Table renames architecture registers

Load Store Queue maintains memory access ordering

Prof. Sean Lee’s Slide

Registers

Original

Code

Renamed

Code

T0

T1

R2 = R1+R3

R4 = R2 - R6

R2 = R7 / R5

BEQ R2, #1

R2 = R4 * R1

T1 = R1+R3

R4 = T1 - R6

T20 = R7 / R5

BEQ T20, #1

T7 = R4 * R1

T2

T3

T4

T5

T6

T7

T8

T9

T10

T11

T12

T13

T14

T15

T16

T17

T18

T19

T20

T21

T22

T23

WAW

Tn-2

Tn-1

WAR

Register Renaming

Architectural

Registers

R0

R1

R2

R3

R4

R5

R6

R7

No False

Dependencies!

Sandy Bridge:

160 PRs for INT

144 PRs for FP

Adapted from Prof. G. Loh’s Slides

Physical

Registers

TagD

Dest  TagD

Register Renaming

Dest = Src1 op Src2

Mapping

Mechanism

Src1  TagS1

Src2  TagS2

TagD =

TagS1 op TagS2

Repeat for each instruction

Adapted from Prof. G. Loh’s Slides

RAT

EAX

EBX

ECX

EDX

ESI

EDI

ESP

EBP

Data

Status

RRF (Retirement Register File)

P6 Style Register Renaming

(So does HP-PA8000, PPC604)

Register Alias Table (RAT)

• Use a lookup table for renaming

• One entry per architectural register

• Each entry maps to the most recent version of the architectural register, could be in

• Physical register file

• Architectural register file

Prof. Sean Lee’s Slide

13

-

-

-

-

-

-

-

13

-

-

-

14

-

-

-

-

15

15

16

-

-

-

-

-

14

14

-

-

-

-

RAT Example

Free Physical Regs

R0

R1

R2

R3

R4

R5

R6

R7

-

-

-

-

-

-

-

-

T13, T14, T15, T16

R1 = R2 + R3

T13 = R2 + R3

T14, T15, T16

R5 = R4 – R1

T14 = R4 – T13

R1 = R1 * R5

T15, T16

T15 = T13 * T14

R2 = R5 / R1

T16

T16 = T14 / T15

Adapted from Prof. G. Loh’s Slides

T31

T19

T6

From free

register pool

Superscalar Rename

T16

T39

T14

T5

R1 = R2 + R3

R4 = R5 – R7

R3 = R0 / R2

R5 = Ld 12[R6]

RAT

T23

T7

T16

X

Don’t rename

immediates

For N-wide

superscalar:

N RAT write-ports

Prof. Sean Lee’s Slide

version of R2

Should be using

this version of R2

Intra-Group Dependencies

T16

T39

T14

T5

R2 = R2 + R3

R4 = R5 – R7

R3 = R0 / R2

R5 = Ld 12[R6]

RAT

T23

T7

T16

X

T10

T31

T19

T6

From free

register pool

Prof. Sean Lee’s Slide

T10 T16

T31 T10

T31 T19

T10

T31

T19

T6

Result of

sequential

renaming

From free

register pool

Intra-Group Dependencies

R1 = R2 + R1

R2 = R1 – R2

R1 = R2 / R1

R1 = R2 >> R1

T16 T34

T34 T16

T16 T34

T16 T34

RAT

Correct final renamed registers

Modified from Prof. Sean Lee’s Slide

Inst 0

Intra-Group

Dependency

Checker

Inst 1

Inst 2

Inst 3

RAT

T0L

T0R

Src L

Src R

Dest

T1L

T1R

T2L

T2R

From free

register pool

T3L

T3R

Pdst0

Pdst1

Pdst2

Adapted from Prof. G. Loh’s Slides

src0L

src1L

src0R

src1R

src2L

src2R

src3L

src3R

dst3

Pdst3

R1R

R2L

R2R

R3L

R3R

R1L

=

=

=

=

=

=

=

=

=

=

=

=

T1L

T1R

T2L

T2R

T3L

T3R

0 1

Intra-Group Dependency Checking

Pdst0

dst0

Pdst1

dst1

Pdst2

dst2

Adapted from Prof. G. Loh’s Slides

dst0

dst1

dst2

dst3

!=

!=

use pdst0

!=

!=

use pdst1

!=

!=

use pdst2

1

use pdst3

Mapping Selection

R1 = R2 + R1

R2 = R1 – R2

R1 = R2 / R1

R1 = R2 >> R1

Only this mapping

for R1 should be

written into the RAT

Condition: use mapping

if instruction is last

writer to the register

Adapted from Prof. G. Loh’s Slides

• add instructions take one cycle

• E.g.,

• Load (left side) induces a “data page fault”;

• If out-of-order completion is allowed

• R10 and r12 will be modified

• Wrong values will be used by the re-issued load

• Interrupt classes

• Program interrupts (exceptions or traps)

• External interrupts (asynchronous)

lw r5, 8(r10)

Modified from Prof. Sean Lee’s Slide

• To reflect a sequential architecture model  Serially correct (think about a single issue, non-pipelined processor)

• Keep “Precise State” of an execution

• All instructions before the interrupted instruction must be completed

• The state should appear as if no instruction issued after the interrupted instruction

• The interrupted PC should be presented to the interrupt handler (restartable)

• Similar to branch misprediction handling

• Out-of-order execution makes the ordering hard

• Undo what comes after an interrupt

Prof. Sean Lee’s Slide

• Need to maintain a precise state (for recovery)

• Software debugging

• I/O or timer interrupts

• Virtual memory (page fault)

• Instruction emulation

• Virtual machines

Prof. Sean Lee’s Slide

• Buffer results

• Can reconstruct the scenario (state) as sequential execution

• Restart from saved PC with saved PC state

Prof. Sean Lee’s Slide

Reorder Buffer (ROB) [SmithPlezkun’85 ‘88]

• Architecture Register File keeps “In-order state”

• Reorder Buffer (ROB)

• A circular buffer

• Contains all in-flight instructions

• In-order allocation/deallocation with head/tail pointers

• When an exception occurs

• Halt instruction issues

• Revert to in-order state using RF and discard ROB results

• Also used for branch misprediction recovery

• Pentium Pro/II/III integrates physical register file within ROB

• Pentium 4 decouples ROB and physical register file

Modified from Prof. Sean Lee’s Slide

Exp

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

(oldest instruction)

Tail

(next inst to be allocated)

Prof. Sean Lee’s Slide

Sandy Bridge : 168-entry ROB

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

xA000

0000

R1

Tail

Handling Precise Interrupts

0

11

R1=R1+10

1

0

0

1

xA004

1

0

0

0000

R2

R2=R2*2

xA008

1

0

0

0000

FR1

FR1=FR2/0.0

ARF

R1

11

1

R2

2

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Handling Precise Interrupts

0

xA004

1

0

0

0000

R2

R2=R2*2

xA008

1

0

0

0000

FR1

FR1=FR2/0.0

xA00C

R3=R3+1

1

0

0

0000

R3

ARF

R1

1

11

R2

2

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Handling Precise Interrupts

0

xA004

1

0

0

0000

R2

R2=R2*2

xA008

1

0

0

0000

FR1

FR1=FR2/0.0

xA00C

R3=R3+1

1

0

1

0000

R3

4

xA010

1

0

0

0000

R4

R4=R4*2

ARF

R1

1

11

R2

2

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Handling Precise Interrupts

0

xA004

1

0

0

0000

R2

R2=R2*2

1

4

xA008

1

0

0

0010

FR1

FR1=FR2/0.0

xA00C

R3=R3+1

1

0

1

0000

R3

4

xA010

1

0

1

0000

R4

8

R4=R4*2

xA014

1

0

0

0000

FR4

FR4=FR4*2.0

ARF

R1

1

11

R2

4

2

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

xA004

1

0

1

0000

R2

R2=R2*2

4

Tail

Handling Precise Interrupts

0

0

xA008

1

0

0

0010

FR1

FR1=FR2/0.0

xA00C

R3=R3+1

1

0

1

0000

R3

4

xA010

1

0

1

0000

R4

8

R4=R4*2

xA014

1

0

0

0000

FR4

FR4=FR4*2.0

ARF

R1

11

1

R2

4

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Exception detected.

Handling Precise Interrupts

These values were not committed into RF

0

0

xA008

1

0

0

0010

FR1

FR1=FR2/0.0

xA00C

R3=R3+1

1

0

1

0000

R3

4

xA010

1

0

1

0000

R4

8

R4=R4*2

xA014

1

0

0

0000

FR4

FR4=FR4*2.0

ARF

R1

1

11

R2

4

1

R3

3

1

R4

4

Back up “PC”

and current RF

1

1

R31

Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

xB000

0000

R1

Tail

Handling Speculative Execution

R1=R1+10

1

0

0

xB004

1

0

0

0000

BEQ R1,R0,L1

ARF

R1

1

R2

2

1

R3

3

1

R4

4

1

1

R31

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

xB000

0000

R1

Tail

Handling Speculative Execution

R1=R1+10

1

0

0

xB004

1

0

0

0000

BEQ R1,R0,L1

xC100

1

1

1

0000

12

R2=R3<<2

R2

xC104

1

1

0

0000

R1=R2*R3

R1

xC108

1

1

0

0000

BEQ R3,R0,L1

xD2B0

1

1

1

0000

R1

R1=R7+1

8

ARF

R1

1

R2

2

1

R3

3

1

R4

4

1

1

R31

BEQ R1, R0, L1 is predicted TAKEN

Modified from Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Handling Speculative Execution

BEQ

Misprediction

xB004

1

0

0

0000

BEQ R1,R0,L1

xC100

1

1

1

0000

12

R2=R3<<2

R2

xC104

1

1

0

0000

R1=R2*R3

R1

xD2AC

1

1

0

0000

BEQ R3,R0,L1

xD2B0

1

1

1

0000

R1

R1=R7+1

8

ARF

R1

11

R2

2

1

R3

3

1

R4

4

1

1

R31

BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

xB004

Tail

1

0

0

0000

BEQ R1,R0,L1

Handling Speculative Execution

ARF

R1

11

R2

2

1

R3

3

1

R4

4

1

1

R31

Retire branch, Clear all entries after the mis-speculated branch

Prof. Sean Lee’s Slide

event

Spec?

Done?

PC

V

RegDst

Data (physical register)

Tail

Handling Speculative Execution

xB008

1

0

0

0000

R2=R5<<4

R2

ARF

R1

11

R2

2

1

R3

3

1

R4

4

1

1

R31

Continue execution from the correct path (Fall through in this case)

Prof. Sean Lee’s Slide

ARF state corresponds to state prior

to oldest non-committed instruction

ARF

As instructions are processed, the RAT corresponds to the register mapping after

the most recently renamed instruction

br

RAT

?!?

On a branch misprediction, wrong-path

instructions are flushed from the machine

The RAT is left with an invalid set of

mappings corresponding to the wrong-

path instruction state

Adapted from Prof. G. Loh’s Slide

Solution: Stall and Drain

Allow all instructions to execute and

commit; ARF corresponds to last

committed instruction

ARF

ARF now corresponds to the state

right before the next instruction to

be renamed (foo)

br

RAT

X

Reset RAT so that all mappings

refer to the ARF

?!?

• Pros: Very simple

to implement

• Cons: Performance loss

due to stalls

Correct path instructions from fetch;

can’t rename because RAT is wrong

Resume renaming the new correct-

path instructions from fetch

Prof. Sean Lee’s Slide

Another Solution: Checkpointing

At each branch, make a copy of the RAT

(register mapping at the time of the branch)

ARF

br

br

RAT

RAT

Checkpoint Free Pool

RAT

RAT

br

RAT

br

On a misprediction:

1. flush wrong-path instructions

2. deallocate RAT checkpoints

3. recover RAT from checkpoint

4. resume renaming

Prof. Sean Lee’s Slide

• At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm)

• Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast)

• When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select)

Fetch &

Dispatch

Fetch &

Dispatch

Fetch &

Dispatch

ARF

ARF

ARF

PRF/ROB

PRF/ROB

Physical register update

Instruction

Scheduler

Bypass

Functional

Units

Adapted from Prof. G. Loh’s Slide

• Wakeup Logic

• To notify the resolution of data dependency of input operands

• Wake up instructions with zero input dependency

• Select Logic

• Choose and fire ready instructions

• Deal with structure hazard

• Wakeup-select is likely on the critical path

• Associative match

Prof. Sean Lee’s Slide

Scalar Scheduler (Issue Width = 1)

Select Logic

T14

=

T39

To Execute Logic

T16

=

T39

T8

=

T6

=

=

T42

T17

=

T39

=

T15

T17

=

T39

From Prof. G. Loh’s Slide

=

=

=

=

=

=

=

=

=

=

=

=

=

=

=

=

T16

Superscalar Scheduler (Issue Width = 4)

Select Logic

T39

To Execute Logic

T39

T8

T6

T42

T17

=

=

=

=

T39

=

=

=

=

T17

T15

=

=

=

=

T39

=

=

=

=

Snapshot of RS (only 4 entries shown)

Adapted from Prof. G. Loh’s Slide

• Select ready instructions to be issued

• Goal: to reduce the height of DFG

• Methods

• Location-based (e.g., leftmost ready first)

• Allow simple, faster hardware

• Can use location-based (in-order issue) with “compaction”

• Compact the issue window to the left every time instructions are issued and by inserting new instructions at the right end

• Can be slow and complex

Prof. Sean Lee’s Slide

Grant3

Req0

Req0

Grant0

Grant0

Req1

Req1

Grant1

Grant1

Req2

Req2

Grant02

Grant02

Req3

Req3

Tree-like

Arbitrated

Selection

Logic

AnyReq

Enable

AnyReq

Enable

Grant3

Grant3

Req0

Req0

Grant0

Grant0

Req1

Req1

Grant1

Grant1

Req2

Req2

Grant02

Grant02

Req3

Req3

AnyReq

AnyReq

Enable

Enable

Simple Select Logic Implementation

Reservation Station

• The Enable signal to the root cell is high whenever the functional unit is ready to execute an instruction

• The AnyReq signal is raised if any of the input Req signals is high

1

Modified from Prof. Sean Lee’s Slide

[Palarchala Dissertation]

Grant3

Req0

Req0

Grant0

Grant0

Req1

Req1

Grant1

Grant1

Req2

Req2

Grant02

Grant02

Req3

Req3

AnyReq

AnyReq

Enable

Enable

Req0

Req1

Req2

Req3

Grt0

Grt1

Grt2

Grt3

Grant3

Grant3

Req0

Req0

Grant0

Grant0

Req1

Req1

Grant1

Grant1

Req2

Req2

Grant02

Grant02

Req3

Req3

AnyReq

AnyReq

Enable

Enable

Priority

Decoder

AnyReq

Enable

Simple Select Logic Implementation

Reservation Station

1

Prof. Sean Lee’s Slide

[Palarchala Dissertation]

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

Simple Select Logic Implementation

Reservation Station

Grant3

Grant3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

AnyReq

Enable

Grant3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

1

Prof. Sean Lee’s Slide

[Palarchala Dissertation]

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

Simple Select Logic Implementation

Reservation Station

Grant3

Grant3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

AnyReq

Enable

Selective Issue for One FU

Grant3

Req0

Grant0

Req1

Grant1

Req2

Grant02

Req3

AnyReq

Enable

1

Prof. Sean Lee’s Slide

[Palarchala Dissertation]

Reservation Station

Issues to Distinctive Functional Units

Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264)

Integer Unit

FPU

Faster to have separate instruction schedulers for different instruction types

Prof. Sean Lee’s Slide

Req1

Req2

Req3

Grant0

Grant1

Grant2

Grant3

Dual Issues to Multiple Units (e.g., 2 Adders)

Req0

Req1

Req2

Req3

Grant0

Grant1

Grant2

Grant3

Prof. Sean Lee’s Slide

[Palarchala Dissertation]

• Can we “undo” stores?

• Stores cannot be committed to memory until they are marked ready to retire

• Completed stores are queued and waiting in a store queue or store buffer

• Disambiguate (and resolve) memory dependency dynamically

Prof. Sean Lee’s Slide

• Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency)

Source: Alpha 21264 HRM

Prof. Sean Lee’s Slide

RS

• Memory instructions are allocated into LSQ in program order

• LSQ manages memory reference ordering

• Unified LSQ vs. Split LSQ

• Sandy Bridge: 64 Load buffers, 36 Store buffers

Age-ordered

ROB

Store Queue

Split LSQ

Prof. Sean Lee’s Slide

0

0

2

1

2

D

C

A

0

2

???

Issued to

Memory

for execution

• Each load checks against older stores

• Associative search

• A performance issue of scalability

Issued?

Issued?

age

age

data

1

1

A

00000001

1

1

B

12340000

0

1

C

FFFF1111

FFFFFF00

Store Queue

Prof. Sean Lee’s Slide

1

1

2

2

1

C

D

A

0

2

???

forwarding

• Implementation dependent: comprehensive size matching can be prohibitively expensive

• Simple method: forward when a larger store (word) precedes a smaller load (half)

Issued?

Issued?

age

age

data

1

1

A

00000001

1

1

B

12340000

0

1

C

FFFF1111

FFFFFF00

Store Queue

Prof. Sean Lee’s Slide

1

1

1

3

2

2

1

C

D

A

K

• Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott))

• “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)

Issued?

Issued?

age

age

data

1

1

A

00000001

1

1

B

12340000

Speculatively issue for execution

0

1

C

FFFF1111

FFFFFF00

0

2

???

Store Queue

Modified from Prof. Sean Lee’s Slide

1

1

1

1

2

1

2

4

3

M

C

D

A

P

• Associative Search

• “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)

Issued?

Issued?

age

age

data

1

1

A

00000001

1

1

B

12340000

1

1

C

FFFF1111

FFFFFF00

0

2

K

1

3

K

Conflict detected!

Store Queue

Prof. Sean Lee’s Slide

memory

0

0

0

1

5

4

6

5

C

D

A

K

Issuing a Store for Execution

• Shown above the basic concept

• Implementation dependent

• Not allow store bypassing load, since it has little impact on performance

• Perform associative search

Issued?

Issued?

age

age

data

1

4

A

11000000

0

6

A

0F0F0F0F

0

6

C

00000002

Store Queue

Prof. Sean Lee’s Slide

0

1

6

5

4

D

A

K

Issuing a Store for Execution

Issued?

Issued?

age

age

data

1

4

A

11000000

0

6

A

0F0F0F0F

0

6

C

00000002

0

5

C

cannot issue

for execution

Store Queue

Prof. Sean Lee’s Slide

1

1

0

0

4

5

7

6

6

M

A

D

N

K

• Needed for

• Multiprocessor support

• Maintaining memory consistency model

• Trap on the later, conflicted instructions

• Replay

Issued?

age

1

5

C

1

6

A

54

Prof. Sean Lee’s Slide

• add instructions take one cycle

• E.g.,

• Load (left side) induces a “data page fault”;

• Add (right side) induces an “instruction page fault”

• If out-of-order completion is allowed

• r10, r12, (or r2, r4) … will be modified

• Wrong values will be used by the re-issued load

• Interrupt classes

• Program interrupts (exceptions or traps)

• External interrupts (asynchronous)

lw r5, 8(r10)

L1:

End of

Non-Resident

Page X

Instruction

Page Fault

Start of

Resident

Page X+1

Prof. Sean Lee’s Slide