M aater suleman moinuddin k qureshi khubaib yale patt
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt*. Feedback-Driven Pipelining. *HPS Research Group The University of Texas at Austin. IBM T.J. Watson Research Center. 1. Background. To leverage CMPs, p rograms must be parallelized Pipeline parallelism:

Download Presentation

M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt*

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


M aater suleman moinuddin k qureshi khubaib yale patt

M. Aater Suleman*

Moinuddin K. Qureshi

Khubaib*

Yale Patt*

Feedback-Driven Pipelining

*HPS Research GroupThe University of Texas at Austin

IBM T.J. Watson Research Center

1


Background

Background

To leverage CMPs, programs must be parallelized

Pipeline parallelism:

Split each loop iteration into multiple stages

Each stage can be assigned more than one core or multiple stages can share a core

Pipeline Parallelism applicable to variety of workloads

Streaming [Gordon+ ASPLOS’06 ]

Recognition, Synthesis and Mining [Bienia+ PACT’08]

Compression/Decompression [Intel TBB 2009]

2


Pipeline parallelism example

Pipeline Parallelism Example

QUEUE1

QUEUE2

Find the N most similar strings to a given search string

S3: Insert

S2: Compare

S1: Read

Search String

9

N-entrysorted on SimilarityScore

First, it reads a candidate string. >>

Next, it compares the candidate string with the search string to compute similarity >>>

Last, it inserts the candidate string into a heap sorted based on similarity. If after the insertion, the heap has more than N elements, it removes the smallest element from the heap.

Once the kernel has iterated through all input strings,>>>

the heap contain the closest N strings. This kernel can be implemented as a 3-stage pipeline with stages S1, S2, and S3.>>>

Note that Stage S2 is scalable because multiple strings can be compared concurrently., However, S3 is non-scalable since only one thread can be allowed to updated the shared heap. >>>

For simplicity, lets assume that the three stages respectively execute for 5, 20, and 10 time units when run as a single thread>>>

7

abssdfkjedwekjwersafsdfsDFSADFkjwelrk

6

abssdfkjedwekjwersafsdfsDFSADFkjwelrk

3

Similarity score:

5


Key problem core to stage allocation

Key Problem: Core to Stage Allocation

S1: Read (1 time unit)

S2: Compare (4 time units)

S3: Insert (1 time unit)

NumCores = 6

Best Alloc. (steady state)

NumCores = 6

2 cores/stage

1 core/stage

  • Allocation impacts both power and performance:

    • -Assigning few cores to a stage can reduce performance

    • -Assigning more cores than needed wastes power

    • Core-to-stage allocation must be chosen carefully

NumCores = 3

NumCores = 1

45

0

5

10

15

20

25

30

35

40


Best core to stage allocation

Best Core-to-Stage Allocation

  • Best allocation depends on relative throughput and scalabilityof each stage

  • Scalability and throughput varies with input set and machine

     Profile-based and compile-time solutions are sub-optimal

  • Millions of possible allocations even for shallow pipelines

    e.g. 8 stage can be allocated to 32 cores in 2.6M ways (integer allocation)

     Brute-force searching of best allocation is impractical

Goal: Automatically find the best core-to-stage allocation at run-time taking into account the input set, machine configuration, and scalability of stages


Outline

Outline

Motivation

Feedback-Driven Pipelining

Case Study

Results

Conclusions

6


Key insights

Key Insights

  • Pipeline performance is limited by the slowest stage: LIMITER

  • LIMITER stage can be identified by measuring the execution time of each stage using existing cycle counters

  • Scalability of a stage can be estimated using hill-climbing, i.e., continue to give cores until performance stops increasing

  • Non-limiter stages can share cores as long as allocating them the same core does not make them slower than the LIMITER

    • Saved cores can be assigned to LIMITER or switched off to save power


Feedback driven pipelining fdp

Feedback-Driven Pipelining (FDP)

Assign One Core per Stage

Available cores?

No

Yes

Add a core to the current LIMITER

Performance?

Improves

Degrades

Take one core from LIMITER, Save Power

Combine fastest stages on one core

Performance?

Same

Degrades


Required support

Required Support

  • FDP uses Instructions to read the Time Stamp Counter (rdtsc)

  • Software: Modify worker thread to call FDP library functions

FDP_Init()

While(!DONE)

stage_id = FDP_InitStage()

Pop a work quanta

FDP_BeginStage (stage_id)

Run stage

FDP_EndStage(stage_id)

Push the iteration to the in-queue of next stage


Performance considerations

Performance Considerations

  • All required data structures are maintained in software and only use virtual memory

  • Training data is collected by reading the cycle counter at the start and end of each stage’s execution

    • We reduce overhead by sampling only 1/128 iterations

    • Training can continue seamlessly at all times

  • FDP algorithm runs infrequently – once every 2000 iterations

  • Each allocation is tried only once to ensure convergence – almost zero-overhead once converged


Outline1

Outline

Motivation

Feedback-Driven Pipelining

Case Study

Results

Conclusions

11


M aater suleman moinuddin k qureshi khubaib yale patt

Experimental Methodology

  • Measurements taken on an Intel-based 8-core SMP (2xCore2Quad chips)

  • Nine pipeline workloads from various domains

  • Evaluated configurations:

    • FDP

    • Profile-based

    • Proportional Allocation

  • Total execution times measured using the Linux time utility (expts. repeated to reduce randomness due to I/O and OS)


M aater suleman moinuddin k qureshi khubaib yale patt

Case Study I: compress

LIMITER

FDP gives more

cores to S3

FDP gives evenmore cores to S3

FDP combines stages

to free up cores

Optimized

execution


Outline2

Outline

Motivation

Feedback-Driven Pipelining

Case Study

Results

Conclusions

14


M aater suleman moinuddin k qureshi khubaib yale patt

Performance

Speedup WRT 1-core

On Avg, Profile-Based provides 2.86x speedup and FDP 4.3x speedup


M aater suleman moinuddin k qureshi khubaib yale patt

Robustness to input set

Speedup WRT 1-core

(Input set hard to compress)

S3 stage now takes 80K-140K cycles instead of 2.4M cycles

S5 (writing output to files) takes 80K cycles too and is non-scalable


M aater suleman moinuddin k qureshi khubaib yale patt

Savings in Active Cores

Number of Active Cores

FDP not only improves performance but can save power too!


M aater suleman moinuddin k qureshi khubaib yale patt

Scalability to Larger Systems

Larger machine: 16-core system (4x AMD Barcelona)

Evaluating Profile-Based is Impractical (Several thousand configs.)

Speedup WRT 1-core

FDP provides 6.13x (vs. 4.3x with Prop.).

FDP also saves power (11.5 active cores vs. 16 with Prop.)


Outline3

Outline

Motivation

Feedback-Driven Pipelining

Case Study

Results

Conclusions

19


Conclusions

Conclusions

Pipelined parallelism applicable to wide variety of workloads

Key problem: How many cores to assign to each stage?

Our insight: performance limited by slowest stage: LIMITER

Our proposal FDP identifies LIMITER stage at runtime using existing performance counters

FDP uses a hill-climbing algorithm to estimate stage scalability

FDP finds the best core-to-stage allocation successfully

Speedup of 4.3x vs. 2.8x with practical profile-based

Robust to input set and scalable to larger machines

Can be used to save power when LIMITER does not scale

20


M aater suleman moinuddin k qureshi khubaib yale patt

Questions


Related work

Related Work

Flextream

Hormati+ (PACT 2009)

Does not take stage scalability into account

Requires dynamic recompilation

Compile-time tuning of pipeline workloads:

Navarro+ (PACT 2009, ICS 2009), Liao+ (JS 2005), Gonzalez+ (Parallel Computing 2003)

Profile Based Allocation in Domain Specific apps.

22


Feedback driven pipelining fdp1

Feedback-Driven Pipelining (FDP)

Assign One Core per Stage

Available cores?

No

Yes

Combine fastest stages on one core

Add a core to the current limiter

Seen before?

Seen before?

Yes

Performance?

No

Performance?

Degrades

Undo change

Degrades

Undo change

23


Fdp for work sharing model

FDP for Work Sharing Model

FDP Performs similar to WorkSharing with Best number of threads!

24


Data structures for fdp

Data Structures for FDP

25


  • Login