Procesadores superescalares
Download
1 / 71

Procesadores Superescalares - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Procesadores Superescalares. Prof. Mateo Valero. Las Palmas de Gran Canaria 26 de Noviembre de 1999. Initial developments. Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1946: ENIAC by J.P. Eckert and J. Mauchly

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Procesadores Superescalares' - ivo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Procesadores superescalares

Procesadores Superescalares

Prof. Mateo Valero

Las Palmas de Gran Canaria

26 de Noviembre de 1999


Initial developments
Initial developments

  • Mechanical machines

  • 1854: Boolean algebra by G. Boole

  • 1904: Diode vacuum tube by J.A. Fleming

  • 1946: ENIAC by J.P. Eckert and J. Mauchly

  • 1945: Stored program by J.V. Neuman

  • 1949: EDSAC by M. Wilkes

  • 1952: UNIVAC I and IBM 701





Superscalar processor

Fetch

Decode

Rename

Instruction

Window

Wakeup+

select

Register

file

Bypass

Data Cache

Superscalar Processor

Fetch of multiple instructions every cycle.

Rename of registers to eliminate added dependencies.

Instructions wait for source operands and for functional units.

Out- of -order execution, but in order graduation.

Scalable Pipes


Technology trends and impact
Technology Trends and Impact

Delay in Psec.

Issue Width= 4Issue Width= 8

ROB Size = 32ROB Size = 64

S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.


Physical scalability
Physical Scalability

Die reachable (percent)

0,25 0,18 0,13 0,1 0,08 0,06

Processor generation (microns)

Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.


Register influence on ilp
Register influence on ILP

8-way fetch/issue

window of 256 entries

up to 1 taken branch

g-share 64k entries

One cycle latency

  • Spec95


Register file latency
Register File Latency

  • 66% and 20% performance improvement when moving from 2 to 1-cycle latency


Outline
Outline

  • Virtual-physical register

  • A register file cache

  • VLIW architectures


Virtual physical registers
Virtual-Physical Registers

  • Motivation

    • Conventional renaming scheme

    • Virtual-Physical Registers

Icache

Decode&Rename

Commit

Register used

Register unused

Register used


Example
Example

Cache miss: 20

Fdiv: 20

Fmul: 10

Fadd: 5

load f2, 0(r4)

fdiv f2, f2, f10

fmul f2, f2, f12

fadd f2, f2, 1

load p1, 0(r4)

fdiv p2, p1, p10

fmul p3, p2, p12

fadd p4, p2, 1

rename

  • Register pressure: average registers per cycle

Conventional: 3.6

Virtual-Physical: 0.7



Virtual physical register
Virtual-Physical register

  • Physical register play two different roles

    • Keep track of dependences (decode)

    • Provide a storage location for results (write-back)

  • Proposal: Three types of registers

    • Logical: Architected registers

    • Virtual-Physical (VP): Keep track of dependences

    • Physical: Store values

  • Approach

    • Decode: rename from logical to VP

    • Write-back (or issue): rename from VP to physical


Virtual physical registers1

R2

General Map Table

Phy. Map Table

R1

Lreg

VP

Preg

V

Preg

Inst. queue

Src2

Src1

D

ROB

Lreg

VPreg

C

Virtual-Physical Registers

  • Hardware support

VPreg

Fetch

Execute

Decode

Write-back

Commit

Issue


Virtual physical registers2
Virtual-Physical Registers

  • No free physical register

    • Re-execute but… if it is the oldest instruction…

    • Avoiding deadlock

      • A number (NRR) of registers are reserved for the oldest instructions

      • 21% speedup for Spec95 on a 8-way issue [HPCA-4]

    • Conclusions

      • Optimal NRR is different for each program

      • For a given program, best NRR may be different for different sections of code


Virtual physical registers3

Performance evaluation

SimpleScalar OoO with modified renaming

8-way issue

RUU: 128 entries

FU (latency)

8 Simple int. (1)

4 Int Mult (7)

6 Simple FP (4)

4 FP Mult (4)

4 FP Div (16)

4 mem ports

L1 Dcache

32 KB, 2-way, 32 B/line, 1 cycle

L1 Icache

32 KB, 2-way, 64 B/line, 1 cycle

L2 cache

1 MB, 2-way, 64 B/line, 12 cycles

Main memory

50 cycles

Branch prediction

18-bit Gshare

2 taken branches

Benchmarks: SPEC95

Compac/Dec compilers -O5

Virtual-Physical Registers


Virtual physical registers4
Virtual-Physical Registers

  • Performance evaluation



Virtual physical registers5
Virtual-Physical Registers

  • What is the optimal allocation policy ?

    • Approximation

      • Registers should be allocated to the instructions that can use them earlier (avoid unused registers)

      • If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest)

    • Implementation

      • Each instruction allocates a physical register in the write-back. If none available, it steals the register from the latest instruction after the current


Dsy performance
DSY Performance

SpecInt95

SpecFp99



Outline1
Outline

  • Virtual-physical register

  • A register file cache

  • VLIW architecture



Register file latency1
Register File Latency

  • 66% and 20% performance improvement when moving from 2 to 1-cycle latency




Register file cache
Register File Cache

  • Organization

    • Bank 1 (Register File)

      • All registers (128)

      • 2-cycle latency

    • Bank 2 (Reg. File Cache)

      • A subset of registers (16)

      • 1-cycle latency

RF

RFC


Experimental framework

OoO simulator

8-way issue/commit

Functional Units (lat.)

2 Simple integer (1)

3 Complex integer

Mult. (2)

Div. (14)

4 Simple FP (2)

2 FP div.: 2 (14)

3 Branch (1)

4 Load/Store

128-entry ROB

16-bit Gshare

Icache and Dcache

64 KB

2-way set-associative

1/8-cycle hit/miss

Dcache: Lock-up free-16 outstanding misses

Benchmarks

Spec95

DEC compiler -O4 (int.) -O5 (FP)

100 million after inizialitations

Access time and area models

Extension to Wilton&Jouppi models

Experimental Framework


Caching policy 1 of 3
Caching Policy (1 of 3)

  • First policy

    • Many values (85%-Int and 84%-FP) are used at most once

    • Thus, only non-bypassed values are cached

    • FIFO replacement

RF

RFC


Performance
Performance

  • 20% and 4% improvement over 2-cycle

  • 29% and 13% degradation over 1-cycle


Caching policy 1 of 2

RF

RFC

Caching Policy (1 of 2)

  • Second policy

    • Values that are sources of any non-issued instruction with all its operands ready

      • Not issued because of lack of functional units

      • or, the other operand in in the main register file


Performance1
Performance

  • 24% and 5% improvement over 2-cycle

  • 25% and 12% degradation over 1-cycle


Caching policy 1 of 31
Caching Policy (1 of 3)

  • Third policy

    • Values that are sources of any non-issued instruction with all its operands ready

    • Prefetching

      • Table that for each physical register indicates which is the other operand of the first instruction that uses it

    • Replacement: give priority to those values already read at least once


Performance2
Performance

  • 27% and 7% improvement over 2-cycle

  • 24% and 11% degradation over 1-cycle


Speed for different rfc architectures
Speed for Different RFC Architectures

Taken into account access time

SpecInt95



Conclusions
Conclusions

  • Register file access time is critical

  • Virtual-physical registers significantly reduce the register pressure

    • 24% improvement for SpecFP95

  • A register file cache can reduce the average access time

    • 27% and 7% improvement for a two-level, locality-based partitioning architecture


High performance instruction fetch through a software hardware cooperation

High performance instruction fetch through a software/hardware cooperation

Alex Ramirez

Josep Ll. Larriba-Pey

Mateo Valero

UPC-Barcelona


Superscalar processor1

Fetch software/hardware cooperation

Decode

Rename

Instruction

Window

Wakeup+

select

Register

file

Bypass

Data Cache

Superscalar Processor

Fetch of multiple instructions every cycle.

Rename of registers to eliminate added dependencies.

Instructions wait for source operands and for functional units.

Out- of -order execution, but in order graduation.

J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.


Motivation
Motivation software/hardware cooperation

Branch /Jump outcome

Instruction

Fetch &

Decode

Instruction

Execution

  • Instruction Fetch rate important not only in steady state

    • Program start-up

    • Miss-speculation points

    • Program segments with little ILP

Instruction Queue(s)


Motivation1
Motivation software/hardware cooperation

  • Instruction fetch effectively limits the performance of superscalar processors

    • Even more relevant at program startup points

  • More aggressive processors need higher fetch bandwidth

    • Multiple basic block fetching becomes necessary

  • Current solutions need extensive additional hardware

    • Branch address cache

    • Collapsing buffer: multi-ported cache

    • Trace cache: special purpose cache


Postgresql
PostgreSQL software/hardware cooperation

64KB I1, 64KB D1, 256KB L2

L=0

B

L

B


Programs behaviour
Programs Behaviour software/hardware cooperation

64KB I1, 64KB D1, 256KB L2


The fetch unit 1 of 3

Fetch software/hardware cooperation

Address

  • Scalar Fetch Unit

    • Few instructions per cycle

    • 1 branch

  • Limitations

    • Prediction accuracy

    • I-cache miss rate

  • Prev. work, code reordering

    • Fisher (IEEE Tr. on Comp. 81)

    • Hwu and Chang (ISCA’89)

    • Petis and Hansen (Sigplan’90)

    • Torrellas et al. (HPCA’95)

    • Kalamatianos et al. (HPCA’98)

Instruction

Cache

(i-cache)

Branch

Prediction

Mechanism

Next Address

Logic

Shift & Mask

Scalar Fetch Unit

To Decode

Next Fetch Address

Software,

reduce cache

misses

The Fetch Unit (1 of 3)


The fetch unit 2 of 3

Hardware, software/hardware cooperation

form traces

at run time

The Fetch Unit (2 of 3)

Fetch

Address

  • Aggressive Fetch Unit

    • Lot of instructions per cycle

    • Several branches

  • Limitations

    • Prediction accuracy

    • Sequentiality

    • I-cache miss rate

  • Prev. work, trace building

    • Yeh et al. (ICS’93)

    • Conte et al. (ISCA’95)

    • Rottenberg et al. (MICRO’96)

    • Friendly et al. (MICRO’97)

Instruction

Cache

(i-cache)

Branch

Target

Buffer

Return

Stack

Multiple

Branch

Predictor

Next Address

Logic

Shift & Mask

Aggressive

Core Fetch Unit

To Decode

Next Fetch Address


Trace cache
Trace Cache software/hardware cooperation

b0

Trace is a sequence of logically contiguos instructions.

Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….)

It is indexed by fetch address and branches outcome

History-based fetch mecanism.

b1

b3

b2

b6

b7

b4

b5

b8


The fetch unit 3 of 3

Fetch software/hardware cooperation

Address

Instruction

Cache

(i-cache)

Branch

Target

Buffer

Return

Stack

Multiple

Branch

Predictor

Next Address

Logic

Shift & Mask

Aggressive

Core Fetch Unit

The Fetch Unit (3 of 3)

Trace Cache

(t-cache)

Fill

Buffer

Trace Cache aims at

forming traces

run time

To Decode

Next Fetch Address

From Fetch or Commit


Our contribution
Our Contribution software/hardware cooperation

  • Mixed software-hardware approach

    • Optimize performance at compile-time

      • Use profiling information

      • Make optimum use of the available hardware

    • Avoid redundant work at run-time

      • Do not repeat what was done at compile-time

      • Adapt hardware to the new software

  • Software Trace Cache

    • Profile-directed code reordering & mapping

  • Selective Trace Storage

    • Fill Unit modification


Our work
Our Work software/hardware cooperation

  • Workload analysis

    • Temporal locality

    • Sequentiality

  • Software Trace Cache

    • Seed selection

    • Trace building

    • Trace mapping

    • Results

  • Selective Trace Storage

    • Counting blue traces

    • Implementation

    • Results


Workload analysis reference locality
Workload Analysis (Reference Locality) software/hardware cooperation

  • Considerable amount of reference locality


Workload analysis sequentiality
Workload Analysis (Sequentiality) software/hardware cooperation

Predictable

Un-predictable

  • Loop branches

  • Indirect jumps

  • Subroutine returns

  • Unpredictable conditional branches

  • Fall-through

  • Unconditional branches

  • Conditional branches with Fixed Behaviour

  • Subroutine calls


Software trace cache
Software Trace Cache software/hardware cooperation

  • Profile directed code reordering

    • Obtain a weighted control flow graph

    • Select seeds or starting basic blocks

    • Build basic block traces

      • Map dynamically consecutive basic blocks to physically contiguous storage

      • Move unused basic blocks out of the execution path

    • Carefully map these traces in memory

      • Avoid conflict misses in the most popular traces

      • Minimize conflicts among the rest

  • Increased role of the instruction cache

    • Able to provide longer instruction traces


Stc seed selection
STC : Seed Selection software/hardware cooperation

  • All procedure entry points

    • Ordered by popularity

    • Starts building traces on the most popular procedures

  • Knowledge based selection

    • Based on source code knowledge

    • Leads to longer sequences

      • Inlining of the main path of found procedures

    • Loses temporal locality

      • Less popular basic blocks surround the most popular ones


Stc trace building

Branch software/hardware cooperation

Threshold

A1

10

1

B1

0.1

30

A2

10

0.9

20

C1

A3

10

0.55

0.6

0.4

C2

0.45

Valid,

visit later

11

A4

A5

6

4

Exec

Threshold

1

0.4

0.6

1

20

C3

A6

A7

2.4

7.6

0.9

0.1

150

Valid,

visit later

1

1

C5

C4

A8

10

0.01

20

Branch

Threshold

0.99

STC : Trace Building

  • Greedy algorithm

    • Follow the most likely path out of a basic block

    • Add secondary seeds for all other targets

  • Two threshold values

    • Execution threshold

      • Do not include unpopular basic blocks

    • Transition threshold

      • Do not follow unlikely transitions

  • Iterate process with less restrictive thresholds


Stc trace mapping

I-cache size software/hardware cooperation

CFA

STC : Trace Mapping

Most popular traces

Least popular traces

No code here

I-cache


I cache miss rate

Instruction Cache software/hardware cooperation

(i-cache)

BTB

RAS

BP

Next Address Logic

Xchange, Shift & Mask

I-cache Miss Rate


Fetch bandwidth

Instruction Cache software/hardware cooperation

(i-cache)

BTB

RAS

BP

Next Address Logic

Xchange, Shift & Mask

Fetch Bandwidth


Stc results
STC : Results software/hardware cooperation


Stc conclusions
STC: Conclusions software/hardware cooperation

  • STC increases the role of the core fetch unit

    • Build traces at compile-time

      • Increases code sequentiality

    • Map them carefully in memory

      • Reduces instruction cache miss rate

  • Increased core fetch unit performance

    • Trace cache-like performance with no additional hardware cost

      • Compile-time solution

        or ...

    • Optimum results with a small supporting trace cache

      • Better fail-safe mechanism on a trace cache miss


Selective trace storage
Selective Trace Storage software/hardware cooperation

  • The STC constructed traces at compile time

    • Blue traces

      • Built at compile-time

      • Traces containing only consecutive instructions

      • May be provided by the instruction cache in a single cycle

    • Red traces

      • Built at run-time

      • Traces containing taken branches

      • Can be provided by the trace cache in a single cycle

  • Blue traces need not be stored in the trace cache

    • Better usage of the storage space

      • Better performance with same cost

      • Equivalent performance at lower cost


Sts counting blue traces

Reordering reduces the number of breaks software/hardware cooperation

High degree of redundancy,

even in the original code

STS: Counting Blue Traces


Sts implementation

Next Address Logic software/hardware cooperation

Xchange, Shift & Mask

STS: Implementation

Fetch Address

Branch

Target

Buffer

Return

Address

Stack

Multiple

Branch

Predictor

Blue

(redundant)

trace

Fill

Unit

Filter out

blue traces

in the fill unit

Hit

Red trace

components

Next Fetch Address

To Decode


Sts fipa realistic branch predictor
STS: FIPA - software/hardware cooperationRealistic Branch Predictor


Sts fipc realistic bp 64kb i cache
STS: FIPC - software/hardware cooperationRealistic BP - 64KB i-cache


Sts fipa perfect branch predictor
STS: FIPA - software/hardware cooperationPerfect Branch Predictor


Sts conclusions
STS: Conclusions software/hardware cooperation

  • Minor hardware modification

    • Filter out blue traces in the fill unit

      • Avoid redundant run-time work

  • Better usage of the storage space

    • Higher performance with the same cost

    • Equivalent performance at much lower cost

  • Benefits of STS increase when used with STC

    • The more work done at compile-time, the less work left to do at run-time


Conclusions1
Conclusions software/hardware cooperation

  • Instruction fetch is better approached using both software and hardware techniques

    • Compile-time code reorganization

      • Increase code sequentiality

      • Minimize instruction cache misses

    • Avoid run-time redundant work

      • Do not store the same traces twice

  • High fetch unit performance with little additional hardware

    • Small 2KB complementary trace cache & smart fill unit


Future work
Future Work software/hardware cooperation

  • Further increasing fetch performance

    • Increase i-cache performance

      • Reduce miss ratio

      • Reduce miss penalty

    • Increase quality of provided instructions

      • Better branch prediction accuracy

    • Faster recovery after mispredictions

  • Take the path of least resistance

    • Simplicity of design

    • Software approach whenever possible


The End software/hardware cooperation