Automatically tuning task based programs for multi core processors
Download
1 / 62

Automatically Tuning Task-Based Programs for Multi-core Processors - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Automatically Tuning Task-Based Programs for Multi-core Processors. Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine. Motivation. Recent microprocessor trends Number of cores increased rapidly Architectures vary widely

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatically Tuning Task-Based Programs for Multi-core Processors' - kaelem


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatically tuning task based programs for multi core processors

Automatically Tuning Task-Based Programs for Multi-core Processors

Jin Zhou

Brian Demsky

Department of Electrical Engineering and Computer Science

University of California, Irvine


Motivation
Motivation Processors

  • Recent microprocessor trends

    • Number of cores increased rapidly

    • Architectures vary widely

  • Challenges for software development

    • Parallelization is now key for performance

    • Current parallel programming model: threads + locks

      • Hard to develop correct and efficient parallel software

      • Hard to adapt software to changes in architectures


Goals
Goals Processors

  • Automatically generate parallel implementation

  • Automatically tune parallel implementation


Overview
Overview Processors

Profile Data

Program

Processor Specification

Bamboo Compiler

Implementation Generator

Candidate implementations

Simulation-based Evaluator

Leading implementations

Implementation Optimizer

Optimized implementation

Tuned implementations

Code Generator

Optimized multi-core binary

Multi-core Processor


Example
Example Processors

  • MonteCarlo Example

    • Partitions problem into several simulations

    • Executes the simulations in parallel

    • Aggregates results of all simulations


Bamboo language
Bamboo Language Processors

  • A hybrid language combines data-flow and Java

    • Programs are composed of tasks

    • Tasks compose with dataflow-like semantics

    • Tasks contain Java-like object-oriented code internally

    • Programs cannot explicitly invoke tasks

    • Runtime automatically invokes tasks

  • Supports standard object-oriented constructs including methods and classes


Bamboo language1
Bamboo Language Processors

  • Flags

    • Capture current role (type state) of object in computation

    • Each flag captures an aspect of the object’s state

    • Change as the object’s role evolves in program

    • Support orthogonal classifications of objects


class Simulator { Processors

flag run;

flag submit;

flag finished;

...

}

task startup(StartupObject s in initialstate) {

Aggregator aggr = new Aggregator(s.args[0]){merge:=true};

for(int i = 0; i < 4; i++)

Simulator sim = new Simulator(aggr){run:=true};

taskexit(s: initialstate:=false);

}

task simulate(Simulator sim in run) {

sim.runSimulate();

taskexit(sim: run:=false, submit:=true);

}

task aggregate(Aggregator aggr in merge,

Simulator sim in submit) {

boolean allprocessed = aggr.aggregateResult(sim);

if (allprocessed)

taskexit(aggr: merge:=false, finished:=true;

sim: submit:=false, finished:=true);

taskexit(sim: submit:=false, finished:=true);

}

class Aggregator {

flag merge;

flag finished;

}


Bamboo program execution
Bamboo Program Execution Processors

Global Flagged Object Space

Runtime initialization

new

StartupObject

StartupObject

initialstate state

finished state


Bamboo program execution1
Bamboo Program Execution Processors

Global Flagged Object Space

execute on

startup task

StartupObject

StartupObject

finished state

initialstate state


Bamboo program execution2
Bamboo Program Execution Processors

set

Global Flagged Object Space

startup task

StartupObject

new

Aggregator

Simulator

Simulator

Simulator

Simulator

StartupObject

initialstate state

finished state

finished state

merge state

Aggregator

Simulator

submit state

finished state

run state


Bamboo program execution3
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

execute on

execute on

simulate task

simulate task

simulate

Simulator

Simulator

Simulator

Simulator

simulate task

simulate task

execute on

execute on

StartupObject

initialstate state

finished state

merge state

finished state

Aggregator

Simulator

run state

submit state

finished state


Bamboo program execution4
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

set

set

simulate task

simulate task

Simulator

Simulator

Simulator

Simulator

simulate task

simulate task

set

set

StartupObject

initialstate state

finished state

finished state

Aggregator

merge state

Simulator

run state

submit state

finished state


Bamboo program execution5
Bamboo Program Execution Processors

Global Flagged Object Space

aggregate task

StartupObject

execute on

Aggregator

Simulator

Simulator

Simulator

Simulator

StartupObject

initialstate state

finished state

Aggregator

merge state

finished state

Simulator

run state

submit state

finished state


Bamboo program execution6
Bamboo Program Execution Processors

aggregate task

Global Flagged Object Space

StartupObject

Aggregator

set

Simulator

Simulator

Simulator

Simulator

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution7
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

execute on

aggregate task

Simulator

Simulator

Simulator

Simulator

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution8
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

set

aggregate task

Simulator

Simulator

Simulator

Simulator

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution9
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator

aggregate task

Simulator

Simulator

execute on

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution10
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator

aggregate task

set

Simulator

Simulator

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution11
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator

aggregate task

Simulator

Simulator

execute on

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Bamboo program execution12
Bamboo Program Execution Processors

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator

aggregate task

Simulator

Simulator

set

StartupObject

finished state

initialstate state

finished state

merge state

Aggregator

Simulator

submit state

run state

finished state


Implementation generation
Implementation Generation Processors

Profile Data

Bamboo Program

Processor Specification

Bamboo Compiler

Implementation Generator

Candidate implementations

Simulation-based Evaluator

Leading implementations

Implementation Optimizer

Optimized implementation

Tuned implementations

Code Generator

Optimized multi-core binary

Multi-core Processor


Implementation generation1
Implementation Generation Processors

  • Dependence Analysis:analyzes data dependence between tasks

  • Parallelism Exploration: extracts potential parallelism

  • Mapping to Cores: maps the program to real processor


Flag state transition graph fstg
Flag State Transition Graph (FSTG) Processors

Simulator

run

simulate:32Mcyc; 100%

submit

aggregate:2Mcyc; 100%

finished


Combined flag state transition graph cfstg
Combined Flag State Transition Graph (CFSTG) Processors

StartupObject

initialstate

Number of new objects

startup:3Mcyc; 100%

finished

1

4

Simulator

Aggregator

run

merge

aggregate:2Mcyc; 75%

simulate:32Mcyc; 100%

aggregate:2Mcyc; 25%

submit

finished

aggregate:2Mcyc; 100%

finished


Initial mapping
Initial Mapping Processors

Core Group

StartupObject

initialstate

startup:3Mcyc; 100%

finished

1

4

Simulator

Aggregator

run

merge

simulate:32Mcyc; 100%

aggregate:2Mcyc; 75%

aggregate:2Mcyc; 25%

submit

finished

aggregate:2Mcyc; 100%

finished


Preprocessing phase
Preprocessing Phase Processors

  • Identifies strongly connected components (SCC) and merges them into a single core group

  • Converts CFSTG into a tree of core groups by replicating core groups as necessary


Data locality rule
Data Locality Rule Processors

StartupObject

  • Default rule

  • Maximize data locality to improve performance

    • Minimizes inter-core communications

    • Improves cache behavior

initialstate

startup:3Mcyc; 100%

finished

1

4

Aggregator

merge

Simulator

aggregate:2Mcyc; 75%

run

aggregate:2Mcyc; 25%

finished

StartupObject

4

1

Simulator

Aggregator


Data parallelization rule
Data Parallelization Rule Processors

StartupObject

  • To explore potential data parallelism

initialstate

startup:3Mcyc; 100%

finished

1

4

Aggregator

merge

Simulator

aggregate:2Mcyc; 75%

run

aggregate:2Mcyc; 25%

finished

1

StartupObject

Simulator

1

1

StartupObject

Aggregator

4

Simulator

1

Simulator

Aggregator

Simulator

Simulator

1

1


Rate matching rule
Rate Matching Rule Processors

Producer

  • If the producer executes multiple times in a cycle, how many consumers are required?

  • Match two rates to estimate the number of consumers

    • Peak new object creation rate

    • Object consumption rate

produce

init

Consumer

produce

run

Consumer

Producer

Consumer


Mapping to processor
Mapping to Processor Processors

  • Extended CFSTG

  • Constraint: limited cores

1

StartupObject

Simulator

1

1

Aggregator

Simulator

Simulator

Simulator

1

1

Core 1

Core 2

  • Map CFSTG core groups to physical cores


Mapping to cores
Mapping to Cores Processors

  • One possible mapping

Core 1

1

StartupObject

Simulator

1

1

Aggregator

Simulator

Core 2

Simulator

Simulator

1

1


Mapping to cores1
Mapping to Cores Processors

  • Isomorphic mappings: have same performance

Core 1

Core 1

1

1

StartupObject

StartupObject

Simulator

Simulator

1

1

1

1

Core 2

Aggregator

Aggregator

Simulator

Simulator

Core 2

Simulator

Simulator

Simulator

Simulator

1

1

1

1

  • Backtracking-based search: to generate non-isomorphic implementations


Implementation generation2
Implementation Generation Processors

Profile Data

Bamboo Program

Processor Specification

Bamboo Compiler

Implementation Generator

Candidate implementations

Simulation-based Evaluator

Leading implementations

Implementation Optimizer

Optimized implementation

Tuned implementations

Code Generator

Optimized multi-core binary

Multi-core Processor


Simulation based evaluation
Simulation-Based Evaluation Processors

  • To select the best candidate implementation

  • High-level simulation

    • Does NOT actually execute the program

    • Constructs abstract execution trace with similar statistics

    • Compare the execution time or throughput and core usage

Simulator

Core

Core

Task

Task

Task

Task


Simulation based evaluation1
Simulation-Based Evaluation Processors

  • Markov model

    • Built from profile data

    • For each task estimates:

      • The destination state

      • The execution time

      • A count of each type of new objects

StartupObject

Simulator

initialstate

1

startup:3Mcyc; 100%

fnished

Simulator

1

1

Aggregator

1

merge

aggregate:2Mcyc; 75%

Simulator

aggregate:2Mcyc; 25%

finished

1

Simulator

run

simulate:32Mcyc; 100%

submit

aggregate:2Mcyc; 100%

finished


Simulated execution trace
Simulated Execution Trace Processors

core 0

core 1

StartupObject(1)

0

startuptask

3

Aggregator(1), Simulator (4)

transfer a Simulator

Simulator(1)

4

simulatetask

simulatetask

Aggregator(1), Simulator(1), Simulator(2)

35

Simulator(1)

36

Aggregator(1), Simulator(2), Simulator(2)

transfer a Simulator

37

simulatetask

Aggregator(1), Simulator(3), Simulator(1)

67

simulatetask

99

Aggregator(1), Simulator(4)

Aggregator(1), Simulator(4)

aggregatetask

Aggregator(1), Simulator(3)

101

1 Aggregator in the initial state and 4 Simulators in the submit state

aggregatetask

Aggregator(1), Simulator(2)

103

aggregatetask

Aggregator(1), Simulator(1)

105

aggregatetask

107

empty


Problem of exhaustive searching
Problem of Exhaustive Searching Processors

  • The search space expands quickly

  • Exhaustive search is not feasible for complicated applications


Random search
Random Search? Processors

  • Very low chance to find the best implementation

Chance to find the best implementation


Developer optimization process
Developer Optimization Process Processors

  • Create an initial implementation

  • Evaluate it and identify performance bottlenecks

  • Heuristically develop new implementations to remove bottlenecks

  • Iteratively repeat evaluation and optimization


Directed simulated annealing dsa
Directed Simulated Annealing (DSA) Processors

Randomly generate candidate implementations

Directed Simulated Annealing

High-level Simulator

Leading candidate implementations

As-built Critical Path Analysis

Potential bottlenecks

New candidate implementations

Implementation Generator

Tuned candidate implementation


As built critical path abcp
As-Built Critical Path (ABCP) Processors

core 0

core 1

  • Provide post-mortem analysis of project management

StartupObject(1)

0

startuptask

3

Aggregator(1), Simulator (4)

transfer a Simulator

Simulator(1)

4

simulatetask

simulatetask

Aggregator(1), Simulator(1), Simulator(2)

35

Simulator(1)

36

Aggregator(1), Simulator(2), Simulator(2)

transfer a Simulator

37

simulatetask

Aggregator(1), Simulator(3), Simulator(1)

67

simulatetask

1

Simulator

StartupObject

1

99

Aggregator(1), Simulator(4)

Simulator

aggregatetask

1

1

Aggregator

Aggregator(1), Simulator(3)

101

Simulator

aggregatetask

1

Aggregator(1), Simulator(2)

103

aggregatetask

Simulator

Aggregator(1), Simulator(1)

105

aggregatetask

107

empty


As built critical path analysis
As-Built Critical Path Analysis Processors

core 0

core 1

StartupObject(1)

0

startuptask

  • Compute the time when a task invocation’s data dependences are resolved

0

3

Aggregator(1), Simulator (4)

transfer a Simulator

Simulator(1)

4

simulatetask

3

simulatetask

Aggregator(1), Simulator(1), Simulator(2)

35

Simulator(1)

36

transfer a Simulator

Aggregator(1), Simulator(2), Simulator(2)

37

simulatetask

3

Aggregator(1), Simulator(3), Simulator(1)

67

simulatetask

3

99

Aggregator(1), Simulator(4)

aggregatetask

35

Aggregator(1), Simulator(3)

101

aggregatetask

101

Aggregator(1), Simulator(2)

103

aggregatetask

103

Aggregator(1), Simulator(1)

105

aggregatetask

105

107

empty


Waiting task optimization
Waiting Task Optimization Processors

  • Waiting tasks:

    • Tasks whose real invocation time is later than the time when all its data dependences are resolved

    • Delayed because of resource conflicts

    • Bottlenecks, remove them from ABCP

  • Optimization

    • Migrate waiting tasks to spare cores

    • Shorten the ABCP to improve performance


Critical task optimization
Critical Task Optimization Processors

  • There may not exist spare cores to move waiting tasks to

  • Identify critical tasks: tasks that produce data that is consumed immediately

  • Attempt to execute critical tasks as early as possible

  • Migrate other tasks which blocked some critical task to other cores

core 0

core 1

simulatetask

Aggregator(1), Simulator(1), Simulator(2)

35

Simulator(1)

36

simulatetask

simulatetask

1

Aggregator(1), Simulator(3), Simulator(1)

67

Simulator(2)

simulatetask

3

2

99

Aggregator(1), Simulator(4)

aggregatetask

35

Aggregator(1), Simulator(3)

101

aggregatetask


Code generator
Code Generator Processors

Profile Data

Bamboo Program

Processor Specification

Bamboo Compiler

Implementation Generator

Candidate implementations

Simulation-based Evaluator

Leading implementations

Implementation Optimizer

Optimized implementation

Tuned implementations

Code Generator

Intermediate C code

Optimized multi-core binary

Multi-core Processor


Evaluation
Evaluation Processors

  • MIT RAW simulator

    • Cycle accurate simulator configured for 16 cores

    • RAW chip: tiled chip, shared memory, on-chip network

  • Benchmarks:

    • Series: Java Grande benchmark suite

    • MonteCarlo: Java Grande benchmark suite

    • FilterBank: StreamIt benchmark suite

    • Fractal


Speedups on 16 cores
Speedups on 16 cores Processors

  • Successfully generated implementations with good performance


Comparison to hand written c code
Comparison to Hand-Written C Code Processors

  • Overhead of Bamboo:

    • Small for Series and Fractal

    • Larger overhead for MonteCarlo and FilterBank:

      • GCC cannot reorder instructions to fill floating-point delay slots for Bamboo implementations due to imprecise alias results

      • Easy to add alias information to facilitate the reordering


Comparison of estimation and real execution
Comparison of Estimation and Real Execution Processors

  • The simulation estimations are close to the real execution time



Fractal
Fractal Processors


Montecarlo
MonteCarlo Processors


Filterbank
FilterBank Processors


Generality of synthesized implementation
Generality of Synthesized Implementation Processors

  • The speedups of both 16-core Bamboo versions are similar

  • Successfully generate a sophisticated implementation utilizing pipelining for MonteCarlo


Related work
Related Work Processors

  • Data-flow and streaming languages:

    • Bamboo relaxes typical restrictions in these models to permit:

      • Flexible mutation of data structures

      • Data structures of arbitrarily complex constructs

    • Bamboo supports applications that non-deterministically access data

  • Tuple-space language: compiler cannot automatically create multiple instantiations to utilize multiple cores

  • Self-tuning libraries: mostly address specific computations


Conclusion
Conclusion Processors

  • We developed a new approach to automatically tune task-based programs for multi-core processors

    • Automatically generate parallel implementations

    • Automatically tune according to specific architecture

  • The approach was evaluated on MIT RAW simulator

    • Successfully generated implementations with good performance

    • Successfully generated a sophisticated implementation utilizing pipelining

  • Can be extended to the broader context of traditional programming languages


Thank you! Processors


Future work
Future Work Processors

  • Apply our approach on non-simulated multi-core processors

  • Develop more sophisticated processor specification

  • Explore rich set of applications


Design rationale
Design Rationale Processors

  • Why not dynamic scheduling?

    • Bad scalability over increasing cores

    • Our basic approach makes it easier to adapt to future changes in architectures


Tree transform
Tree Transform Processors

4

Producer1

1

Consumer

1

Producer2

4

Producer1

Consumer

1

Producer2

1

Consumer


Tags Processors

  • Motivation: consider a video processor example

  • Tags group objects together:

    • Tags have types

    • Can create many instances of a tag type

    • Each instance defines a group

  • Can bind tag instances to objects

  • Tags can specify that task parameters must be in the same group


ad