Master slave speculative parallelization
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Master/Slave Speculative Parallelization PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Master/Slave Speculative Parallelization. Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin). The Basics. #1: a well-known problem: On-chip Communication #2: a well-known opportunity: Program Predictability #3: our novel approach to #1 using #2. Problem: Communication.

Download Presentation

Master/Slave Speculative Parallelization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Master slave speculative parallelization

Master/Slave Speculative Parallelization

Craig Zilles (U. Illinois)

Guri Sohi (U. Wisconsin)


The basics

The Basics

  • #1: a well-known problem:

    On-chip Communication

  • #2: a well-known opportunity:

    Program Predictability

  • #3: our novel approach to #1 using #2


Problem communication

Problem: Communication

  • Cores becoming “communication limited”

    • Rather than “capacity limited”

  • Many, many transistors on a chip, but…

  • Can’t bring them all to bear on one thread

    • Control/data dependences = freq. communication


Best core chip size

Best core << chip size

  • Sweet spot for core size

    • Further size increases either hurts Mhz or IPC

      How can we maximize core’s efficiency?

Chip

Core


Opportunity predictability

bne

0%

100%

Opportunity: Predictability

  • Many programs behaviors are predictable

    • Control flow, dependences, values, stalls, etc.

  • Widely exploited by processors/compilers

    • But, not to help increase effective core size

    • Core resources used to make, validate pred’s

  • Example: perfectly-biased branch


Speculative execution

Speculative Execution

Uses space in I-cache

Branch predictor

  • Execute code before/after branch in parallel

  • Branch is fetched, predicted, executed, retired

    All of this occurs in the core

Uses execution resources

Not just the branch, but its backwards slice


Trace superblock formation

bne

0%

100%

CC

Trace/Superblock Formation

  • Optimize code assuming the predicted path

    • Reduces cost of branch and surrounding code

    • Prediction implicitly encoded in executable

  • Code still verifies prediction

    • Branch & slice still fetched, executed, committed, etc.

      All of this occurs on the core


Why waste core resources

Why waste core resources?

  • The branch is perfectly predictable!

    The core should only execute instructions that are not statically predictable!


If not in the core where

If not in the core, where?

  • Anywhere else on chip!

  • Because it is predictable:

    • Doesn’t prevent forward progress

    • We can tolerate latency to verify prediction

Instruction Storage

Prediction

Verify Prediction


A concrete example master slave speculative parallelization

A concrete example:Master/Slave Speculative Parallelization

  • Execute “distilled program” on one processor

    • A version of program with predictable inst’s removed

    • Faster than original, but not guaranteed to be correct

  • Verify predictions by executing original program

    • Parallelize verification by splitting it into “tasks”

Master core:

Executes distilled program

Slave cores:

Parallel execution of original program


Talk outline

Talk Outline

  • Removing predictability from programs

    • “Approximation”

  • Externally verifying distilled programs

    • Master/Slave Speculative Parallelization (MSSP)

  • Results Summary

  • Summary


Approximation transformations

A

AB

bne

B

C

C

1%

99%

Approximation Transformations

  • Pretend you’ve proven the common case

    • Preserve correctness in the common case

    • Break correctness in uncommon case

      • Use profile to know the common case

A

B

C


Not just for branches

Load is highly invariant (usually gets 7)

addi $zero, 7, r13

  • Memory Dependences:

    st r12, 0(A)

    ld r11, 0(B)

never

A and B may alias

always

mv r12, r11

Not just for branches

  • Values:

    ld r13, 0(X)

If rarely alias in practice?

If almost always alias?


Enables traditional optimizations

printf

Two dominant paths

fprintf

csEOF

exit

fwrite

printf

Enables Traditional Optimizations

Many static paths

Approximate away unimportant paths

From bzip2


Enables traditional optimizations1

Enables Traditional Optimizations

Many static paths

Two dominant paths

Approximate away unimportant paths

From bzip2


Enables traditional optimizations2

Enables Traditional Optimizations

Many static paths

Two dominant paths

Approximate away unimportant paths

Very straightforward structure

Easy for compiler to optimize

From bzip2


Enables traditional optimizations3

Enables Traditional Optimizations

Many static paths

Two dominant paths

Approximate away unimportant paths

Very straightforward structure

Easy for compiler to optimize

From bzip2


Effect of approximation

Effect of Approximation

Original Code

Distilled Code

  • Equivalent 99.999% of the time, better execution characteristics

    • Fewer dynamic instructions: ~1/3 of original code

    • Smaller static size: ~2/5 of original code

    • Fewer taken branches: ~1/4 of original code

    • Smaller fraction of loads/stores

  • Shorter than best non-speculative code

    • Removing checks: code incorrect .001% of the time


Talk outline1

Talk Outline

  • Removing predictability from programs

    • “Approximation”

  • Externally verifying distilled programs

    • Master/Slave Speculative Parallelization (MSSP)

  • Results Summary

  • Summary


Master slave speculative parallelization

Goal

  • Achieve performance of distilled program

  • Retain correctness of original program

  • Approach:

    • Use distilled code to speed original program


Checkpoint parallelization

Checkpoint parallelization

  • Cut original program into “tasks”

    • Assign tasks to processors

  • Provide each a checkpoint of registers & memory

    • Completely decouples task execution

    • Tasks retrieve all live-ins from checkpoint

  • Checkpoints taken from distilled program

    • Captured in hardware

    • Stored as a “diff” from architected state


Master slave speculative parallelization

Master core:

Executes distilled program

Slave cores:

Parallel execution of original program


Example execution

Start Master and Slave from architected state

A’

A

B’

B

Take checkpoint, use to start next task

C’

C

Verify B’s inputs with A’s outputs; commit state

C

C’

Bad Checkpoint @ C

Detected at end of B

Squash, restart from architected state

Example Execution

Master

Slave1

Slave2

Slave3


Mssp critical path

  • Bad checkpoints:

  • through original program

  • interprocessor comm.

MSSP Critical Path

Master

Slave1

Slave2

Slave3

  • If checkpoints correct:

  • through distilled program

  • no communication latency

  • verification in background

A’

A

B’

B

C’

C

C

C’

  • If bad checkpoints are rare:

  • performance of distilled program

  • tolerant of communication latency


Talk outline2

Talk Outline

  • Removing predictability from programs

    • “Approximation”

  • Externally verifying distilled programs

    • Master/Slave Speculative Parallelism (MSSP)

  • Results Summary

  • Summary


Methodology

Methodology

  • First-cut distiller

    • Static binary-to-binary translator

    • Simple control flow approximations

    • DCE, inlining, register re-allocation, save/restore elimination, code layout…

  • HW model: 8-way CMP of 21264’s

    • 10 cycle interconnect latency to shared L2

  • Spec2000 Integer benchmarks on Alpha


Results summary

Results Summary

  • Distilled Programs can be accurate

    • 1 task misspeculation per 10,000 instructions

  • Speedup depends on distillation

    • 1.25 h-mean: ranges from 1.0 to 1.7 (gcc, vortex)

    • (relative to uniprocessor execution)

  • Modest storage requirements

    • Tens of kB at L2 for speculation buffering

  • Decent latency tolerance

    • Latency 5 -> 20 cycles: 10% slowdown


Distilled program accuracy

Distilled Program Accuracy

100,000

10,000

1,000

Average distance between task misspeculations:

> 10,000 original program instructions


Distillation effectiveness

Distillation Effectiveness

Instructions retired by Master(distilled program)

Instructions retired by Slave(original program

(not counting nops)

100%

60%

0%

Up to two-thirds reduction


Performance

Performance

100,000

10,000

accuracy

1,000

100%

60%

distillation

0%

1.6

1.4

Speedup

1.2

1.0

Performance scales with distillation effectiveness


Related work

Related Work

  • Slipstream

  • Speculative Multithreading

  • Pre-execution

  • Feedback-directed Optimization

  • Dynamic Optimizers


Summary

Summary

  • Don’t waste core on predictable things

    • “Distill” out predictability from programs

  • Verify predictions with original program

    • Split into tasks: parallel validation

    • Achieve the throughput to keep up

  • Has some nice attributes (ask offline)

    • Can support legacy binaries, latency tolerant, low verification cost, complements explicit parallelism


  • Login