compiler supported coarse grained pipelined parallelism why and how n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How PowerPoint Presentation
Download Presentation
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Loading in 2 Seconds...

play fullscreen
1 / 30

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How. Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University. Overall Context.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How' - hanley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compiler supported coarse grained pipelined parallelism why and how

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Gagan Agrawal

Wei Du

Tahsin Kurc

Umit Catalyurek

Joel Saltz

The Ohio State University

overall context
Overall Context
  • NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August 2005.
  • Project Components
    • Runtime Optimizations in the DataCutter System
    • Compiler Optimization of DataCutter filters
    • Automatic Generation of DataCutter filters
      • Focus of this talk
general motivation
General Motivation
  • Language and Compiler Support for Parallelism of many forms has been explored
    • Shared memory parallelism
    • Instruction-level parallelism
    • Distributed memory parallelism
    • Multithreaded execution
  • Application and technology trends are making another form of parallelism desirable and feasible
    • Coarse-Grained Pipelined Parallelism
coarse grained pipelined parallelism cgpp

Range_query

Find the K-nearest neighbors

Coarse-Grained Pipelined Parallelism(CGPP)
  • Definition
    • Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units
  • Example — K-nearest Neighbor

Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and

a point  = (a, b, c).

We want to find the nearest K neighbors of  within R.

coarse grained pipelined parallelism is desirable feasible
Coarse-Grained Pipelined Parallelism is Desirable & Feasible
  • Application scenarios

data

data

data

data

Internet

data

data

data

coarse grained pipelined parallelism is desirable feasible1
Coarse-Grained Pipelined Parallelism is Desirable & Feasible
  • A new class of data-intensive applications
    • Scientific data analysis
    • data mining
    • data visualization
    • image analysis
  • Two direct ways to implement such applications
    • Downloading all the data to

user’s machine – often not feasible

    • Computing at the data repository - usually too slow
coarse grained pipelined parallelism is desirable feasible2
Coarse-Grained Pipelined Parallelism is Desirable & Feasible
  • Our belief
    • A coarse-grained pipelined execution model is a good match

data

Internet

data

coarse grained pipelined parallelism needs compiler support
Coarse-Grained Pipelined Parallelism needs Compiler Support
  • Computation needs to be decomposed into stages
  • Decomposition decisions are dependent on execution environment
    • How many computing sites available
    • How many available computing cycles on each site
    • What are the available communication links
    • What’s the bandwidth of each link
  • Code for each stage follows the same processing pattern, so it can be generated by compiler
  • Shared or distributed memory parallelism needs to be exploited
  • High-level language and compiler support are necessary
outline
Outline
  • Coarse-grained pipelined parallelism is desirable & feasible
  • Coarse-grained pipelined parallelism needs high-level language & compiler support
  • An entire picture of the system
  • DataCutter runtime system & language dialect
  • Overview of the challenges for the compiler
  • Compiler Techniques
  • Experimental results
  • Related work
  • Future work & Conclusions
an entire picture

Decomposition

Code Generation

DataCutter Runtime

System

An Entire Picture

Java Dialect

Compiler Support

datacutter runtime system

stream

stream

filter1

filter2

filter3

DataCutter Runtime System
  • Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al)
  • Targets a distributed, heterogeneous environment
  • Allow decomposition of application-specific data processing operations into a set of interacting processes
  • Provides a specific low-level interface
    • filter
    • Stream
  • layout & placement
language dialect
Language Dialect
  • Goal
    • to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism
  • Extensions of Java
    • Pipelined_loop
    • Domain & Rectdomain
    • Foreach loop
    • reduction variables
iso surface extraction example code

RectDomain<1> PacketRange = [1:4];

Pipelined_loop (b in PacketRange)

{

0. foreach ( …) { … }

1. foreach ( …) { … }

… …

n-1. S;

}

Merge

ISO-Surface Extraction Example Code

public class isosurface {

public static void main(String arg[]) {

float iso_value;

RectDomain<1> CubeRange = [min:max];

CUBE[1d] InputData = new CUBE[CubeRange];

Point<1> p, b;

RectDomain<1> PacketRange =

[1:runtime_def_num_packets];

RectDomain<1> EachRange =

[1:(max-min)/runtime_define_num_packets];

Pipelined_loop (b in PacketRange) {

Foreach (p in EachRange) {

InputData[p].ISO_SurfaceTriangles(iso_value,…);

}

… …

}}

For (int i=min; i++; i<max-1)

{

// operate on InputData[i]

}

overview of the challenges for the compiler
Overview of the Challenges for the Compiler
  • Filter Decomposition
    • To identify the candidate filter boundaries
    • Compute communication volume between two consecutive filters
    • Cost Model
    • Compute a mapping from computations in a loop to computing units in a pipeline
  • Filter Code Generation
slide15

Identify the Candidate Filter Boundaries

  • Three types of candidate boundaries
    • Start & end of a foreach loop
    • Conditional statement

If ( point[p].inRange(high, low) ) {

local_KNN(point[p]);

}

    • Start & end of a function call within a foreach loop
  • Any non-foreach loop must be completely inside a single filter
compute required communication
Compute Required Communication

ReqComm(b) = the set of values need to be communicated through this boundary

Cons(B) = the set of variables that are used in B, not defined in B

Gens(B) = the set of variables that are defined in B, still alive at the end of B

ReqComm(b2) =

ReqComm(b1) –

Gens(B) +

Cons(B)

b2

B

b1

cost model
Cost Model
  • Cost Model
    • A sequence of m computing units, C1,…, Cm with computing powers P(C1), …, P(Cm)
    • A sequence of m-1 network links, L1, …, Lm-1 with bandwidths B(L1), …, B(Lm-1)
    • A sequence of n candidate filter boundaries b1, …, bn
cost model1

C1

L1

C2

L2

C3

Say, L2 is bottleneck stage,

T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)

Say, C2 is bottleneck stage,

T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

Cost Model

stage

C1

L1

C2

L2

C3

time

filter decomposition

n+1+m-1

m-1

Filter Decomposition

f1

Goal:

Find a mapping: Li → bj, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n.

Intuitively, the candidate filter boundary bj is inserted between computing units Ci and Ci+1

C1

b1

L1

f2

C2

Cm-1

fn

Lm-1

bn

Cm

fn+1

Exhaustive search

filter decomposition a greedy algo

C1

L1

C2

C3

C4

Filter Decomposition: A Greedy Algo.
  • To minimize the predicted execution time

f1

C1

b1

L1

f2

Estimated Cost

f1 , f2

f1

C2

b2

L1 to b1 : T1

f3

L2

L1 to b2 : T2

C3

b3

L1 to b3 : T3

f4

L1 to b4 : T4

L3

b4

C4

Min{T1 … T4 } = T2

f5

code generation
Code Generation
  • Abstraction of the work each filter does
    • Read in a buffer of data from input stream
    • Iterate over the set of data
    • Write out the results to output stream
  • Code generation issues
    • How to get the Cons(b) from the input stream

--- unpacking data

    • How to organize the output data for the successive filter --- packing data
experimental results
Experimental Results
  • Goal
    • To show Compiler-generated code is efficient
  • Environment settings
    • 700MHZ Pentium machines
    • Connected through Myrinet LANai 7.0
  • Configurations

# data sites --- # computing sites --- user machine

    • 1-1-1
    • 2-2-1
    • 4-4-1
experimental results1
Experimental Results
  • Versions
    • Default version
      • Site hosting the data only reads and transmits data, no processing at all
      • User’s desktop only views the results, no processing at all
      • All the work are done by the compute nodes
    • Compiler-generated version
      • Intelligent decomposition is done by the compiler
      • More computations are performed on the end nodes to reduce the communication volume
    • Manual version
      • Hand-written DataCutter filters with similar decomposition as the compiler-generated version

Computing nodes workload heavy

Communication volume high

workload balanced between each node

Communication volume reduced

experimental results iso surface rendering z buffer based
Experimental Results: ISO-Surface Rendering (Z-Buffer Based)

Small dataset

150M

Large dataset

600M

20% improvement over

default version

Width of pipeline

Width of pipeline

Speedup 1.92 3.34

Speedup 1.99 3.82

experimental results iso surface rendering active pixel based
Experimental Results: ISO-Surface Rendering (Active Pixel Based)

Small dataset

150M

Large dataset

600M

> 15% improvement

over default version

Width of pipeline

Width of pipeline

Speedup close to linear

experimental results knn
Experimental Results: KNN

K = 3

108M

K = 200

108M

>150% improvement over

default version

Width of pipeline

Width of pipeline

Speedup 1.89 3.38

Speedup 1.87 3.82

experimental results virtual microscope
Experimental Results: Virtual Microscope

Small query

800M, 512*512

Large query

800M, 2048*2048

≈40% improvement over

default version

Width of pipeline

Width of pipeline

experimental results2
Experimental Results
  • Summary
    • The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions
    • In most cases, increasing the width of the pipeline results in near-linear speedup
    • Compared with the manual version, the compiler-decomposed versions are generally quite close
ongoing and future work
Ongoing and Future Work
  • Buffer size optimization
  • Cost model refinement & implementation
  • More applications
  • More realistic environment settings: resource dynamically available
conclusions
Conclusions
  • Coarse-Grained Pipelined Parallelism is desirable & feasible
  • Coarse-Grained Pipelined Parallelism needs language & compiler support
  • An algorithm for required communication analysis is given
  • A greedy algorithm for filter decomposition is developed
  • A cost model is designed
  • Results of detailed evaluation of our compiler are encouraging