commutativity analysis a new analysis framework for parallelizing compilers n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers PowerPoint Presentation
Download Presentation
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Loading in 2 Seconds...

play fullscreen
1 / 61

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. Martin C. Rinard University of California, Santa Barbara. Goal. Automatically Parallelize Irregular, Object-Based Computations That Manipulate Dynamic, Linked Data Structures. Structure of Talk.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers' - evangelia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
commutativity analysis a new analysis framework for parallelizing compilers

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Martin C. Rinard

University of California, Santa Barbara

slide2
Goal

Automatically Parallelize

Irregular, Object-Based Computations

That Manipulate

Dynamic, Linked Data Structures

structure of talk
Structure of Talk
  • Model of Computation
  • Graph Traversal Example
  • Commutativity Testing
    • Basic Technique
    • Practical Extensions
    • Advanced Techniques
  • Synchronization Optimizations
  • Experimental Results
  • Future Research
model of computation
Model of Computation

5

7

operations

objects

executing

operation

new object

state

operation

5

5

9

initial object

state

invoked operations

example

3

6

3

6

6

3

2

5

2

5

7

Example

Weighted In Degree Computation

  • Weighted Graph With Weights On Edges
  • Goal Is to Compute, For Each Node, Sum of Weights on All Incoming Edges
  • Serial Algorithm: Marked Depth-First Traversal
serial code for example

3

6

6

3

2

5

7

Serial Code For Example

class node {

node *left, *right;

int left_weight, right_weight;

int sum;

boolean marked;

};

void node::traverse(int weight) {

sum += weight;

if (!marked) {

marked = true;

if (left !=NULL) left->traverse(left_weight);

if (right!=NULL) right->traverse(right_weight);

} }

Goal: Execute leftandrightTraverse Operations In Parallel

parallel traversal

3

6

2

5

Parallel Traversal

3

6

3

6

2

5

2

5

3

6

3

6

6

3

2

5

2

5

parallel traversal1
Parallel Traversal

3

6

3

6

6

6

3

3

2

5

2

5

2

3

6

3

6

6

6

3

3

2

5

2

5

7

3

6

3

6

6

6

3

3

2

5

2

5

5

traditional approach
Traditional Approach
  • Data Dependence Analysis
    • Compiler Analyzes Reads and Writes
    • Finds Independent Pieces of Code
    • Independent Pieces of Code Execute in Parallel
  • Demonstrated Success for Array-Based Programs
    • Dense Matrices
    • Affine Access Functions
data dependence analysis in example
Data Dependence Analysis in Example
  • For Data Dependence Analysis to Succeed in Example
    • left and right Traverse Must Be Independent
    • left and right Subgraphs Must Be Disjoint
    • Graph Must Be a Tree
  • Depends on Global Topology of Data Structure
    • Analyze Code that Builds Data Structure
    • Extract and Propagate Topology Information
  • Fails for Graphs - Computations Are Not Independent!
commuting operations in parallel traversal
Commuting Operations In Parallel Traversal

3

6

3

6

6

6

3

3

2

5

2

5

2

3

6

3

6

6

6

3

3

2

5

2

5

7

3

6

3

6

6

6

3

3

2

5

2

5

5

commutativity analysis
Commutativity Analysis
  • Compiler Computes Extent of the Computation
    • Representation of all Operations in Computation
    • Algorithm Traverses Call Graph
    • In Example: { node::traverse }
  • Do All Pairs of Operations in Extent Commute?
    • No - Generate Serial Code
    • Yes - Generate Parallel Code
    • In Example: All Pairs Commute
generated code in example
Generated Code In Example

class node {

lock mutex;

node *left, *right;

int left_weight, right_weight;

int sum;

boolean marked;

};

void node::traverse(int weight) {

parallel_traverse(weight);

wait();

}

Class Declaration

Driver Version

generated code in example1
Generated Code In Example

void node::parallel_traverse(int weight) {

mutex.acquire();

sum += weight;

if (!marked) {

marked = true;

mutex.release();

if (left !=NULL)

spawn(left->parallel_traverse(left_weight));

if (right!=NULL)

spawn(right->parallel_traverse(right_weight));

} else {

mutex.release();

}

}

Critical Region

properties of commutativity analysis
Properties of Commutativity Analysis
  • Oblivious to Data Structure Topology
    • Local Analysis
    • Simple Analysis
  • Suitable for a Wide Range of Programs
    • Programs that Manipulate Lists, Trees and Graphs
    • Commuting Updates to Central Data Structure
    • General Reductions
    • Incomplete Programs
  • Introduces Synchronization
separable operations
Separable Operations

Each Operation Consists of Two Sections

Object Section

Only Accesses Receiver Object

Invocation Section

Only Invokes Operations

Both Sections

May Access Parameters and Local Variables

commutativity testing conditions
Commutativity Testing Conditions
  • Do Two Operations A and B Commute?
  • Compiler Must Consider Two Potential Execution Orders
    • A executes before B
    • B executes before A
  • Compiler Must Check Two Conditions

Instance Variables

In both execution orders, new values of the instance variables are the same after the execution of the two object sections

Invoked Operations

In both execution orders, the two invocation sections together directly invoke the same multiset of operations

commutativity testing algorithm
Commutativity Testing Algorithm
  • Symbolic Execution
    • Compiler Executes Operations
    • Computes with Expressions Instead of Values
  • Compiler Symbolically Executes Operations

In Both Execution Orders

    • Expressions for New Values of Instance Variables
    • Expressions for Multiset of Invoked Operations
checking instance variables condition
Checking Instance Variables Condition
  • Compiler Generates Two Symbolic Operations

n->traverse(w1) and n->traverse(w2)

  • In Order n->traverse(w1); n->traverse(w2)
    • New Value of sum = (sum+w1)+w2
    • New Value of marked = true
  • In Order n->traverse(w2); n->traverse(w1)
    • New Value of sum = (sum+w2)+w1
    • New Value of marked = true
checking invoked operations condition
Checking Invoked Operations Condition
  • In Order n->traverse(w1); n->traverse(w2)

Multiset of Invoked Operations Is

if (!marked&&left!=NULL) left->traverse(left_weight),

if (!marked&&right!=NULL) right->traverse(right_weight)

  • In Order n->traverse(w2); n->traverse(w1)

Multiset of Invoked Operations Is

if (!marked&&left!=NULL) left->traverse(left_weight),

if (!marked&&right!=NULL) right->traverse(right_weight)

expression simplification and comparison
Expression Simplification and Comparison
  • Compiler Applies Rewrite Rules to Simplify Expressions
    • b+(a+c) => (a+b+c)
    • a*(b+c) => (a*b)+(a*c)
    • a+if(b<c,d,e) => if(b<c,a+d,a+e)
  • Compiler Compares Corresponding Expressions
    • If All Equal - Operations Commute
    • If Not All Equal - Operations May Not Commute
practical extensions
Practical Extensions

Exploit Read-Only Data

  • Recognize When Computed Values Depend Only On
    • Unmodified Instance Variables or Global Variables
    • Parameters
  • Represent Computed Values Using Opaque Constants
  • Increases Set of Programs that Compiler Can Analyze
  • Operations Can Freely Access Read-Only Data

Coarsen Commutativity Testing Granularity

  • Integrate Operations into Callers for Analysis Purposes
  • Mechanism: Interprocedural Symbolic Execution
  • Increases Effectiveness of Commutativity Testing
advanced techniques
Advanced Techniques
  • Relative Commutativity

Recognize Commuting Operations That Generate Equivalent But Not Identical Data Structures

  • Techniques for Operations that Contain Conditionals
    • Distribute Conditionals Out of Expressions
    • Test for Equivalence By Doing Case Analysis
  • Techniques for Operations that Access Arrays
    • Use Array Update Expressions to Represent New Values
    • Rewrite Rules for Array Update Expressions
  • Techniques for Operations that Execute Loops
commutativity testing for operations with loops
Commutativity Testing for Operations With Loops
  • Prerequisite: Represent Values Computed In Loops
  • View Body of Loop as an Expression Transformer
    • Input Expressions: Values Before Iteration Executes
    • Output Expressions: Values After Iteration Executes
  • Represent Values Computed In Loop Using

Recursively Defined Symbolic Loop Modeling Functions

int t=sum; for(i)t=t+a[i]; sum=t;

s(e,0) = e s(e,i+1) = s(e,i)+a[i]

New Value of sum=s(sum,n)

  • Use Nested Induction Proofs to Determine Equivalence of Expressions With Symbolic Loop Modeling Functions
important special case
Important Special Case
  • Independent Operations Commute
  • Analysis in Current Compiler
    • Dependence Analysis
      • Operations on Objects of Different Classes
      • Independent Operations on Objects of Same Class
    • Symbolic Commutativity Testing
      • Dependent Operations on Objects of Same Class
  • Future
    • Integrate Shape Analysis
    • Integrate Array Data Dependence Analysis
programming model extensions
Programming Model Extensions
  • Extensions for Read-Only Data
    • Allow Operations to Freely Access Read-Only Data
    • Enhances Ability of Compiler to Represent Expressions
    • Increases Set of Programs that Compiler Can Analyze
  • Analysis Granularity Extensions
    • Integrate Operations Into Callers for Analysis Purposes
    • Coarsens Commutativity Testing Granularity
      • Reduces Number of Pairs Tested for Commutativity
      • Enhances Effectiveness of Commutativity Testing
optimizations
Optimizations
  • Parallel Loop Optimization
  • Suppress Exploitation of Excess Concurrency
  • Synchronization Optimizations
    • Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data
    • Lock Coarsening
      • Replaces Multiple Mutual Exclusion Regions with
      • Single Larger Mutual Exclusion Region
default code generation strategy
Default Code Generation Strategy

Each Object Has its Own Mutual Exclusion Lock

Each Operation Acquires and Releases Lock

Simple Lock Optimization

Eliminate Lock Constructs In Operations That Only Access Read-Only Data

data lock coarsening transformation
Data Lock Coarsening Transformation
  • Compiler Gives Multiple Objects the Same Lock
    • Current Policy: Nested Objects Use the Lock in Enclosing Object
  • Finds Sequences of Operations
    • Access Different Objects
    • Acquire and Release Same Lock
  • Transformed Code
    • Acquires Lock Once At Beginning of Sequence
    • Releases Lock Once At End of Sequence
  • Original Code
    • Each Operation Acquires and Releases Lock
data lock coarsening example
Data Lock Coarsening Example

Original Code

Transformed Code

class vector {

lock mutex;

double val[NDIM];

}

void vector::add(double *v){

mutex.acquire();

for(int i=0; i < NDIM; i++)

val[i] += v[i];

mutex.release();

}

class body {

lock mutex;

double phi;

vector acc;

};

void body::gravsub(body *b){

double p, v[NDIM];

mutex.acquire();

p = computeInter(b,v);

phi -= p;

mutex.release();

acc.add(v);

}

class vector {

double val[NDIM];

}

void vector::add(double *v){

for(int i=0; i < NDIM; i++)

val[i] += v[i];

}

class body {

lock mutex;

double phi;

vector acc;

};

void body::gravsub(body *b){

double p, v[NDIM];

mutex.acquire();

p = computeInter(b,v);

phi -= p;

acc.add(v);

mutex.release();

}

data lock coarsening tradeoff
Data Lock Coarsening Tradeoff
  • Advantage:
    • Reduces Number of Executed Acquires and Releases
    • Reduces Acquire and Release Overhead
  • Disadvantage: May Cause False Exclusion
    • Multiple Parallel Operations Access Different Objects
    • But Operations Attempt to Acquire Same Lock
    • Result: Operations Execute Serially
computation lock coarsening transformation
Computation Lock Coarsening Transformation
  • Compiler Finds Sequences of Operations
    • Acquire and Release Same Lock
  • Transformed Code
    • Acquires Lock Once at Beginning of Sequence
    • Releases Lock Once at End of Sequence
  • Result
    • Replaces Multiple Mutual Exclusion Regions With
    • One Large Mutual Exclusion Region
  • Algorithm Based On Local Transformations
    • Move Lock Acquire and Release To Become Adjacent
    • Eliminate Adjacent Acquire and Release
computation lock coarsening example
Computation Lock Coarsening Example

Original Code

Optimized Code

class body {

lock mutex;

double phi;

vector acc;

};

void body::gravsub(body *b){

double p, v[NDIM];

mutex.acquire();

p = computeInter(b,v);

phi -= p;

acc.add(v);

mutex.release();

}

void body::loopsub(body *b){

int i;

for (i = 0; i < N; i++) {

this->gravsub(b+i);

}

}

class body {

lock mutex;

double phi;

vector acc;

};

void body::gravsub(body *b){

double p, v[NDIM];

p = computeInter(b,v);

phi -= p;

acc.add(v);

}

void body::loopsub(body *b){

int i;

mutex.acquire();

for (i = 0; i < N; i++) {

this->gravsub(b+i);

}

mutex.release();

}

computation lock coarsening tradeoff
Computation Lock Coarsening Tradeoff
  • Advantage:
    • Reduces Number of Executed Acquires and Releases
    • Reduces Acquire and Release Overhead
  • Disadvantage: May Introduce False Contention
    • Multiple Processors Attempt to Acquire Same Lock
    • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region
managing tradeoff lock coarsening policies
Managing Tradeoff: Lock Coarsening Policies
  • To Manage Tradeoff, Compiler Must Successfully
    • Reduce Lock Overhead by Increasing Lock Granularity
    • Avoid Excessive False Exclusion and False Contention
  • Original Policy
    • Use Original Lock Algorithm
  • Bounded Policy
    • Apply Transformation Unless Transformed Code
      • Holds Lock During a Recursive Call, or
      • Holds Lock During a Loop that Invokes Operations
  • Aggressive Policy
    • Always Apply Transformation
choosing best policy
Choosing Best Policy
  • Best Policy May Depend On
    • Topology of Data Structures
    • Dynamic Schedule Of Computation
  • Information Required to Choose Best Policy Unavailable At Compile Time
  • Complications
    • Different Phases May Have Different Best Policy
    • In Same Phase, Best Policy May Change Over Time
use dynamic feedback to choose best policy
Use Dynamic Feedback to Choose Best Policy
  • Sampling Phase: Measures Overhead of Different Policies
  • Production Phase: Uses Best Policy From Sampling Phase
  • Periodically Resample to Discover Changes in Best Policy
  • Guaranteed Performance Bounds

Original

Overhead

Bounded

Aggressive

Original

Aggressive

Time

Sampling Phase

Production Phase

Sampling Phase

methodology
Methodology
  • Built Prototype Compiler for Subset of C++
  • Built Run Time System for Shared Memory Machines
    • Concurrency Generation and Task Management
    • Dynamic Load Balancing and Synchronization
  • Acquired Three Complete Applications
    • Barnes-Hut
    • Water
    • String
  • Automatically Parallelized Applications
  • Ran Applications on Stanford DASH Machine
  • Compare with Highly Tuned, Explicitly Parallel Versions
major assumptions and restrictions
Major Assumptions and Restrictions
  • Assumption: No Violation of Type Declarations
  • Restrictions:
  • Conceptually Significant
  • No Virtual Functions
  • No Function Pointers
  • No Exceptions
  • Operations Access Only
    • Parameters
    • Read-Only Data
    • Data Members Declared in Class of the Receiver
  • Implementation Convenience
  • No Multiple Inheritance
  • No Templates
  • No union, struct or enum Types
  • No typedef Declarations
  • Global Variables Must Be of Class Types
  • No Static Members
  • No Default Arguments or Variable Numbers of Arguments
  • No Numeric Casts
applications
Applications
  • Barnes-Hut
    • O(NlgN) N-Body Solver
    • Space Subdivision Tree
    • 1500 Lines of C++ Code
  • Water
    • Simulates Liquid Water
    • O(N^2) Algorithm
    • 1850 Lines of C++ Code
  • String
    • Computes Model of Geology Between Two Oil Wells
    • 2050 Lines of C++ Code
obtaining serial c version of barnes hut
Obtaining Serial C++ Version of Barnes-Hut
  • Started with Explicitly Parallel Version (SPLASH-2)
  • Removed Parallel Constructs to get Serial C
  • Converted to Clean Object-Based C++
  • Major Structural Changes
    • Eliminated Scheduling Code and Data Structures
    • Split a Loop in Force Computation Phase
    • Introduced New Field into Particle Data Structure
obtaining serial c version of water
Obtaining Serial C++ Version of Water
  • Started with Serial C Translated from Fortran
  • Converted to Clean Object-Based C++
  • Major Structural Change
    • Auxiliary Objects for O(N^2) phases
obtaining serial c version of string
Obtaining Serial C++ Version of String
  • Started With Serial C Translated From Fortran
  • Converted to Clean C++
  • No Major Structural Changes
performance results for barnes hut and water

Ideal

Ideal

Explicitly Parallel

Explicitly Parallel

Commutativity

Analysis

Commutativity

Analysis

Performance Results for Barnes-Hut and Water

16

16

12

12

Speedup

Speedup

8

8

4

4

0

0

0

4

8

12

16

0

4

8

12

16

Number of Processors

Number of Processors

Water on DASH

512 Molecules

Barnes-Hut on DASH

16K Particles

performance results for string

Ideal

Explicitly Parallel

Commutativity

Analysis

16

12

8

4

0

0

4

8

12

16

Performance Results for String

Speedup

Number of Processors

String on DASH

Big Well Model

synchronization optimizations
Synchronization Optimizations
  • Generated A Version of Each Application for Each Lock Coarsening Policy
    • Original
    • Bounded
    • Aggressive
    • Dynamic Feedback
  • Ran Applications on Stanford DASH Machine
lock overhead

40

40

40

Original

30

30

30

20

20

20

Bounded

Original

10

10

10

Bounded

Original

Aggressive

Aggressive

Aggressive

0

0

0

Lock Overhead

Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exculsion Locks

Percentage Lock Overhead

Barnes-Hut on DASH

16K Particles

Water on DASH

512 Molecules

String on DASH

Big Well Model

contention overhead for barnes hut

100

100

100

Original

Bounded

Aggressive

75

75

75

50

50

50

25

25

25

0

0

0

0

4

8

12

16

0

4

8

12

16

0

4

8

12

16

Processors

Processors

Processors

Contention Overhead for Barnes-Hut

Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors

Contention Percentage

Barnes-Hut on DASH

16K Particles

contention overhead for water

100

100

100

75

75

75

50

50

50

25

25

25

0

0

0

0

4

8

12

16

0

4

8

12

16

0

4

8

12

16

Processors

Processors

Processors

Contention Overhead for Water

Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors

Original

Bounded

Aggressive

Contention Percentage

Water on DASH

512 Molecules

contention overhead for string

100

100

75

75

50

50

25

25

0

0

0

4

8

12

16

0

4

8

12

16

Processors

Processors

Contention Overhead for String

Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors

Original

Aggressive

Contention Percentage

String on DASH

Big Well Model

performance results for barnes hut and water1

16

12

8

4

0

0

4

8

12

16

0

4

8

12

16

Performance Results for Barnes-Hut and Water

Ideal

Ideal

Aggressive

Bounded

Dynamic Feedback

Dynamic Feedback

16

Bounded

Original

Original

Aggressive

12

Speedup

Speedup

8

4

0

Number of Processors

Number of Processors

Barnes-Hut on DASH

16K Particles

Water on DASH

512 Molecules

performance results for string1
Performance Results for String

Ideal

Original

16

Dynamic Feedback

Aggressive

12

Speedup

8

4

0

0

4

8

12

16

Number of Processors

String on DASH

Big Well Model

new directions for parallelizing compilers
New Directions For Parallelizing Compilers
  • Presented Results
    • Complete Computations
    • Shared Memory Machines
  • Larger Class of Hardware Platforms
    • Clusters of Shared Memory Machines
    • Fine-Grain Parallel Machines
  • Larger Class of Computations
    • Migratory Computations
    • Persistent Data
    • Multithreaded Servers
migratory computations
Migratory Computations

Goal: Parallelize Computation

Key Issues: Incomplete Programs, Dynamic Parallelization, Persistent Data

multithreaded servers
Multithreaded Servers

Goals: Better Throughput, Better Response Time

Key Issues: Semantics of Communication and Resource Allocation Primitives, Consistency of Shared State

Accepts

Request

Accepts

Request

Sends

Response

Sends

Response

Sends

Response

future
Future
  • Key Trends
    • Increasing Need For Parallel Computing
    • Increasing Availability of Cheap Parallel Hardware
  • Key Challenge
    • Effectively Support Development of Robust, Efficient Parallel Software for Wide Range of Applications
  • Successful Approach Will
    • Leverage Structure in Modern Programming Paradigms
    • Use Advanced Compiler Technology
    • Provide A Simpler, More Effective Programming Model
conclusion
Conclusion
  • New Analysis Framework for Parallelizing Compilers
    • Commutativity Analysis
  • New Class of Applications for Parallelizing Compilers
    • Irregular Computations
    • Dynamic, Linked Data Structures
  • Current Status
    • Implemented Prototype
    • Good Results
  • Future
    • New Hardware Platforms
    • New Application Domains