slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Stanford EE380 5/29/2013 PowerPoint Presentation
Download Presentation
Stanford EE380 5/29/2013

Loading in 2 Seconds...

play fullscreen
1 / 34

Stanford EE380 5/29/2013 - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Stanford EE380 5/29/2013. Drinking from the Firehose Decode in the Mill ™ CPU Architecture. The Mill Architecture. Instructions - Format and decoding. New to the Mill:. Dual code streams No-parse instruction shifting Double-ended decode Zero-length no-ops

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stanford EE380 5/29/2013' - honorato-cunningham


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Stanford EE380 5/29/2013

Drinking from the Firehose

Decode in the Mill™ CPU Architecture

slide2

The Mill Architecture

Instructions -

Format and decoding

New to the Mill:

Dual code streams

No-parse instruction shifting

Double-ended decode

Zero-length no-ops

In-line constants to 128 bits

addsx(b2, b5)

what chip architecture is this
What chip architecture is this?

cores: 4cores

issuing:4operations

clock rate: 3300MHz

power: 130Watts

performance: 52.8 Gips

price: $885dollars

?

general-purpose out-of-order superscalar

(Intel XEON E5-2643)

what chip architecture is this1
What chip architecture is this?

cores: 1 core

issuing: 8operations

clock rate: 456 MHz

power: 1.1 Watts

performance: 3.6 Gips

price: $17dollars

?

in-order VLIW signal processor

(Texas Instruments TMS320C6748)

which is better
Which is better?

cores: 4 cores

issuing: 4 operations

clock rate: 3300 MHz

power: 130 Watts

performance: 52.8 Gips

price: $885 dollars

out-of-order superscalar

performance per Watt per dollar

cores: 1 core

issuing: 8 operations

clock rate: 456 MHz

power: 1.1 Watts

performance: 3.6 Gips

price: $17 dollars

in-order VLIW DSP

which is better1
Which is better?

cores: 4 cores

issuing: 4 operations

clock rate: 3300 MHz

power: 130 Watts

performance: 52.8 Gips

price: $885 dollars

out-of-order superscalar

0.46 mips/W/$

cores: 1 core

issuing: 8 operations

clock rate: 456 MHz

power: 1.1 Watts

performance: 3.6 Gips

price: $17 dollars

in-order VLIW DSP

195 mips/W/$

which is better2
Which is better?

Why 400X difference?

  • 32 vs. 64 bit
  • 3,600 mips vs. 52,800 mips
  • incompatible workloads

signal processing ≠ general-purpose

goal – and technical challenge:

DSP numbers

- on general-purpose workloads

slide8

Our result:

cores: 2 cores

issuing: 33 operations

clock rate: 1200 MHz

power: 28 Watts

performance: 79.3 Gips

price: $85 dollars

OOTBC Mill Gold.x2

33 Mips/W/$

superscalar 0.46 DSP 195

which is better3
Which is better?

33 operations per cycle peak ??? Why?

80% of code is in loops

Pipelined loops have unbounded ILP

  • DSP loops are software-pipelined

But –

  • few general-purpose loops can be piped
  • (at least on conventional architectures)

Solution:

  • pipeline (almost) all loops
  • throw function hardware at pipe

Result: loops now < 15% of cycles

which is better4
Which is better?

33 operations per cycle peak ??? How?

Biggest problem is decode

Fixed length instructions:

Easy to parse

Instruction size:

32 bits X 33 ops = 132 bytes.

Ouch!

Instruction cache pressure.

32k iCache = only 248 instructions

Ouch!!

slide11

Which is better?

33 operations per cycle peak ??? How?

Variable length instructions:

Hard to parse – x86 heroics gets 4 ops

Instruction size:

Mill ~15 bits X 33 ops = 61 bytes.

Ouch!

Instruction cache pressure.

32k iCache = only 537 instructions

Ouch!!

Biggest problem is decode

slide12

A stream of instructions

Logical model

inst

inst

inst

inst

inst

inst

inst

Program

counter

decode

execute

bundle

Physical model

inst

inst

inst

inst

inst

inst

inst

execute

Program

counter

decode

execute

execute

slide13

Fixed-length instructions

bundle

inst

inst

inst

inst

inst

Program

counter

decode

decode

decode

execute

execute

execute

Are easy!

(and BIG)

slide14

Variable-length instructions

bundle

inst

inst

inst

inst

inst

?

?

?

Program

counter

decode

decode

decode

execute

execute

execute

Where does the next one start?

Polynomial cost!

slide15

Polynomial cost

bundle

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

OK if N=3, not if N=30

BUT…

Two bundles of length N are much easier than one bundle of length 2N

Program

counter

So split each bundle in half, and have two streams of half-bundles

slide16

Two streams of half-bundles

half bundle

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

Two physical streams

Program

counter

decode

execute

Program

counter

decode

One logical stream

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

But – how do you branch two streams?

half bundle

slide17

Extended Basic Blocks (EBBs)

  • Group each stream into Extended Basic Blocks, single-entry multiple-exit sequences of bundles.
  • Branches can only target EBB entry points; it is not possible to jump into the middle of an EBB.

EBB

EBB

  • EBB

Program

counter

  • EBB

EBB chain

branch

Program

counter

EBB chain

  • EBB
  • EBB
  • EBB
slide18

Take two half-EBBs

lower memory higher memory

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

EBB head

execution order

EBB head

slide19

Take two half-EBBs

Reverse one in memory

lower memory higher memory

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

execution order

EBB head

EBB head

bundle

Two halves of each instruction have same color

bundle

Two halves of each instruction have same color

execution order

execution order

bundle

EBB head

slide20

And join them head-to-head

Reverse one in memory

lower memory higher memory

bundle

bundle

bundle

bundle

bundle

EBB head

bundle

bundle

bundle

EBB head

slide21

And join them head-to-head

lower memory higher memory

entry point

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

EBB head

EBB head

slide22

And join them head-to-head

Take a branch…

lower memory higher memory

entry point

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

add …

load …

jump loop

Effective address

slide23

Take a branch…

program

counter

program

counter

lower memory higher memory

entry point

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

add …

load …

jump loop

Effective address

slide24

Take a branch…

program

counter

higher

addresses

program

counter

lower

addresses

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

decode

decode

execute

slide25

Take a branch…

program

counter

higher

addresses

program

counter

lower

addresses

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

decode

decode

execute

slide26

Take a branch…

program

counter

higher

addresses

program

counter

lower

addresses

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

bundle

decode

decode

execute

after a branch
After a branch

Transfers of control set both XPC and FPC to the entry point

cycle 0

cycle n

memory

memory

Flowcode

FPC

FPC

EBB entry point

XPC

Exucode

Program counters:

XPC = Exucode

FPC = Flowcode

XPC moves forward

FPC moves backwards

XPC

increasingaddresses

increasingaddresses

physical layout
Physical layout

Conventional

Mill

iCache

critical distance

iCache

critical distance

decode

decode

exec

exec

critical distance

decode

iCache

critical distance

iCache

decode

exec

slide29

Generic Mill bundle format

  • The Mill issues one instruction (two half-bundles) per cycle.
  • That one instruction can call for many independent operations, all of which issue together and execute in parallel.

byte boundary

alignment hole

byte boundary

header

block 1

block 2

block n

block n-1

variable length blocks

  • Each instruction bundle begins with a fixed-length header, followed by blocks of operations; all operations in a block use the same format. The header contains the byte count of the whole bundle and an operation count for each block. Parsing reduces to isolating blocks.
slide30

Generic instruction decode

cycle 0

byte boundary

header

block 2

instruction

buffer

block 1

block 1 count to

block2 shifter &

block 1 decode

Byte count to

instruction shifter

to block 2 shifter

to block 1 decode

cycle 1

byte boundary

bundle

buffer

header

hole

block 3

block 1

  • Bundle is parsed from both ends toward the middle. Two blocks are isolated per cycle

to block 3 decode

block 2

buffer

block 2

to block 2 decode

slide31

Elided No-ops

  • Sometimes a cycle has work only for Exu, only for Flow, or neither. The number of cycles to skip is encoded in the alignment hole of the other code stream.

Exucode:

Flowcode:

hole

hole

head

0

op

head

head

0

0

op

op

op

op

op

head

1

op

op

no-op

head

2

op

no-op

op

no-op

head

head

0

0

op

op

op

op

head

0

op

head

head

0

0

op

op

op

op

op

  • Rarely, explicit no-ops must still be used when there are not enough hole bits to use. Otherwise, no-ops cost nothing.
mill pipeline
Mill pipeline

phase/cycles

mem/L2

prefetch

<varies>

L1 I$

lines

fetch

F0-F2

L0 I$

shifter

D0

decode

bundles

D0-D2

issue

<none>

operations

execute

X0-X4+

retire

<none>

results

reuse

  • 4 cycle mispredict penalty from top cache
split stream double ended encoding
Split-stream, double-ended encoding

One Mill thread has:

Two program counters

Following two instruction half-bundle streams

Drawn from two instruction caches

Feeding two decoders

One of which runs backwards

And each half-bundle is parsed from both ends

For each side:

Instruction size:

Mill ~15 bits X 17 ops = 32 bytes

Instruction cache pressure.

32k iCache = 1024 instructions

Decode rate:

30+ operations per cycle

slide34

Want more?

USENIX Vail, June 23-26

Belt machines – performance with no registers

IEEE Computer Society SVC, September 10

Sequentially consistent, stall-free, in-order

memory access

Sign up for technical announcements, white papers, etc.:

ootbcomp.com