1 / 61

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. Martin C. Rinard University of California, Santa Barbara. Goal. Automatically Parallelize Irregular, Object-Based Computations That Manipulate Dynamic, Linked Data Structures. Structure of Talk.

evangelia
Download Presentation

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard University of California, Santa Barbara

  2. Goal Automatically Parallelize Irregular, Object-Based Computations That Manipulate Dynamic, Linked Data Structures

  3. Structure of Talk • Model of Computation • Graph Traversal Example • Commutativity Testing • Basic Technique • Practical Extensions • Advanced Techniques • Synchronization Optimizations • Experimental Results • Future Research

  4. Model of Computation 5 7 operations objects executing operation new object state operation 5 5 9 initial object state invoked operations

  5. 3 6 3 6 6 3 2 5 2 5 7 Example Weighted In Degree Computation • Weighted Graph With Weights On Edges • Goal Is to Compute, For Each Node, Sum of Weights on All Incoming Edges • Serial Algorithm: Marked Depth-First Traversal

  6. 3 6 6 3 2 5 7 Serial Code For Example class node { node *left, *right; int left_weight, right_weight; int sum; boolean marked; }; void node::traverse(int weight) { sum += weight; if (!marked) { marked = true; if (left !=NULL) left->traverse(left_weight); if (right!=NULL) right->traverse(right_weight); } } Goal: Execute leftandrightTraverse Operations In Parallel

  7. 3 6 2 5 Parallel Traversal 3 6 3 6 2 5 2 5 3 6 3 6 6 3 2 5 2 5

  8. Parallel Traversal 3 6 3 6 6 6 3 3 2 5 2 5 2 3 6 3 6 6 6 3 3 2 5 2 5 7 3 6 3 6 6 6 3 3 2 5 2 5 5

  9. Traditional Approach • Data Dependence Analysis • Compiler Analyzes Reads and Writes • Finds Independent Pieces of Code • Independent Pieces of Code Execute in Parallel • Demonstrated Success for Array-Based Programs • Dense Matrices • Affine Access Functions

  10. Data Dependence Analysis in Example • For Data Dependence Analysis to Succeed in Example • left and right Traverse Must Be Independent • left and right Subgraphs Must Be Disjoint • Graph Must Be a Tree • Depends on Global Topology of Data Structure • Analyze Code that Builds Data Structure • Extract and Propagate Topology Information • Fails for Graphs - Computations Are Not Independent!

  11. Commuting Operations In Parallel Traversal 3 6 3 6 6 6 3 3 2 5 2 5 2 3 6 3 6 6 6 3 3 2 5 2 5 7 3 6 3 6 6 6 3 3 2 5 2 5 5

  12. Commutativity Analysis • Compiler Computes Extent of the Computation • Representation of all Operations in Computation • Algorithm Traverses Call Graph • In Example: { node::traverse } • Do All Pairs of Operations in Extent Commute? • No - Generate Serial Code • Yes - Generate Parallel Code • In Example: All Pairs Commute

  13. Generated Code In Example class node { lock mutex; node *left, *right; int left_weight, right_weight; int sum; boolean marked; }; void node::traverse(int weight) { parallel_traverse(weight); wait(); } Class Declaration Driver Version

  14. Generated Code In Example void node::parallel_traverse(int weight) { mutex.acquire(); sum += weight; if (!marked) { marked = true; mutex.release(); if (left !=NULL) spawn(left->parallel_traverse(left_weight)); if (right!=NULL) spawn(right->parallel_traverse(right_weight)); } else { mutex.release(); } } Critical Region

  15. Properties of Commutativity Analysis • Oblivious to Data Structure Topology • Local Analysis • Simple Analysis • Suitable for a Wide Range of Programs • Programs that Manipulate Lists, Trees and Graphs • Commuting Updates to Central Data Structure • General Reductions • Incomplete Programs • Introduces Synchronization

  16. Commutativity Testing

  17. Separable Operations Each Operation Consists of Two Sections Object Section Only Accesses Receiver Object Invocation Section Only Invokes Operations Both Sections May Access Parameters and Local Variables

  18. Commutativity Testing Conditions • Do Two Operations A and B Commute? • Compiler Must Consider Two Potential Execution Orders • A executes before B • B executes before A • Compiler Must Check Two Conditions Instance Variables In both execution orders, new values of the instance variables are the same after the execution of the two object sections Invoked Operations In both execution orders, the two invocation sections together directly invoke the same multiset of operations

  19. Commutativity Testing Algorithm • Symbolic Execution • Compiler Executes Operations • Computes with Expressions Instead of Values • Compiler Symbolically Executes Operations In Both Execution Orders • Expressions for New Values of Instance Variables • Expressions for Multiset of Invoked Operations

  20. Checking Instance Variables Condition • Compiler Generates Two Symbolic Operations n->traverse(w1) and n->traverse(w2) • In Order n->traverse(w1); n->traverse(w2) • New Value of sum = (sum+w1)+w2 • New Value of marked = true • In Order n->traverse(w2); n->traverse(w1) • New Value of sum = (sum+w2)+w1 • New Value of marked = true

  21. Checking Invoked Operations Condition • In Order n->traverse(w1); n->traverse(w2) Multiset of Invoked Operations Is if (!marked&&left!=NULL) left->traverse(left_weight), if (!marked&&right!=NULL) right->traverse(right_weight) • In Order n->traverse(w2); n->traverse(w1) Multiset of Invoked Operations Is if (!marked&&left!=NULL) left->traverse(left_weight), if (!marked&&right!=NULL) right->traverse(right_weight)

  22. Expression Simplification and Comparison • Compiler Applies Rewrite Rules to Simplify Expressions • b+(a+c) => (a+b+c) • a*(b+c) => (a*b)+(a*c) • a+if(b<c,d,e) => if(b<c,a+d,a+e) • Compiler Compares Corresponding Expressions • If All Equal - Operations Commute • If Not All Equal - Operations May Not Commute

  23. Practical Extensions Exploit Read-Only Data • Recognize When Computed Values Depend Only On • Unmodified Instance Variables or Global Variables • Parameters • Represent Computed Values Using Opaque Constants • Increases Set of Programs that Compiler Can Analyze • Operations Can Freely Access Read-Only Data Coarsen Commutativity Testing Granularity • Integrate Operations into Callers for Analysis Purposes • Mechanism: Interprocedural Symbolic Execution • Increases Effectiveness of Commutativity Testing

  24. Advanced Techniques • Relative Commutativity Recognize Commuting Operations That Generate Equivalent But Not Identical Data Structures • Techniques for Operations that Contain Conditionals • Distribute Conditionals Out of Expressions • Test for Equivalence By Doing Case Analysis • Techniques for Operations that Access Arrays • Use Array Update Expressions to Represent New Values • Rewrite Rules for Array Update Expressions • Techniques for Operations that Execute Loops

  25. Commutativity Testing for Operations With Loops • Prerequisite: Represent Values Computed In Loops • View Body of Loop as an Expression Transformer • Input Expressions: Values Before Iteration Executes • Output Expressions: Values After Iteration Executes • Represent Values Computed In Loop Using Recursively Defined Symbolic Loop Modeling Functions int t=sum; for(i)t=t+a[i]; sum=t; s(e,0) = e s(e,i+1) = s(e,i)+a[i] New Value of sum=s(sum,n) • Use Nested Induction Proofs to Determine Equivalence of Expressions With Symbolic Loop Modeling Functions

  26. Important Special Case • Independent Operations Commute • Analysis in Current Compiler • Dependence Analysis • Operations on Objects of Different Classes • Independent Operations on Objects of Same Class • Symbolic Commutativity Testing • Dependent Operations on Objects of Same Class • Future • Integrate Shape Analysis • Integrate Array Data Dependence Analysis

  27. Steps to Practicality

  28. Programming Model Extensions • Extensions for Read-Only Data • Allow Operations to Freely Access Read-Only Data • Enhances Ability of Compiler to Represent Expressions • Increases Set of Programs that Compiler Can Analyze • Analysis Granularity Extensions • Integrate Operations Into Callers for Analysis Purposes • Coarsens Commutativity Testing Granularity • Reduces Number of Pairs Tested for Commutativity • Enhances Effectiveness of Commutativity Testing

  29. Optimizations • Parallel Loop Optimization • Suppress Exploitation of Excess Concurrency • Synchronization Optimizations • Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data • Lock Coarsening • Replaces Multiple Mutual Exclusion Regions with • Single Larger Mutual Exclusion Region

  30. Synchronization Optimizations

  31. Default Code Generation Strategy Each Object Has its Own Mutual Exclusion Lock Each Operation Acquires and Releases Lock Simple Lock Optimization Eliminate Lock Constructs In Operations That Only Access Read-Only Data

  32. Data Lock Coarsening Transformation • Compiler Gives Multiple Objects the Same Lock • Current Policy: Nested Objects Use the Lock in Enclosing Object • Finds Sequences of Operations • Access Different Objects • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once At Beginning of Sequence • Releases Lock Once At End of Sequence • Original Code • Each Operation Acquires and Releases Lock

  33. Data Lock Coarsening Example Original Code Transformed Code class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); }

  34. Data Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Cause False Exclusion • Multiple Parallel Operations Access Different Objects • But Operations Attempt to Acquire Same Lock • Result: Operations Execute Serially

  35. Computation Lock Coarsening Transformation • Compiler Finds Sequences of Operations • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once at Beginning of Sequence • Releases Lock Once at End of Sequence • Result • Replaces Multiple Mutual Exclusion Regions With • One Large Mutual Exclusion Region • Algorithm Based On Local Transformations • Move Lock Acquire and Release To Become Adjacent • Eliminate Adjacent Acquire and Release

  36. Computation Lock Coarsening Example Original Code Optimized Code class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); }

  37. Computation Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Contention • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region

  38. Managing Tradeoff: Lock Coarsening Policies • To Manage Tradeoff, Compiler Must Successfully • Reduce Lock Overhead by Increasing Lock Granularity • Avoid Excessive False Exclusion and False Contention • Original Policy • Use Original Lock Algorithm • Bounded Policy • Apply Transformation Unless Transformed Code • Holds Lock During a Recursive Call, or • Holds Lock During a Loop that Invokes Operations • Aggressive Policy • Always Apply Transformation

  39. Choosing Best Policy • Best Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable At Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time

  40. Use Dynamic Feedback to Choose Best Policy • Sampling Phase: Measures Overhead of Different Policies • Production Phase: Uses Best Policy From Sampling Phase • Periodically Resample to Discover Changes in Best Policy • Guaranteed Performance Bounds Original Overhead Bounded Aggressive Original Aggressive Time Sampling Phase Production Phase Sampling Phase

  41. Experimental Results

  42. Methodology • Built Prototype Compiler for Subset of C++ • Built Run Time System for Shared Memory Machines • Concurrency Generation and Task Management • Dynamic Load Balancing and Synchronization • Acquired Three Complete Applications • Barnes-Hut • Water • String • Automatically Parallelized Applications • Ran Applications on Stanford DASH Machine • Compare with Highly Tuned, Explicitly Parallel Versions

  43. Major Assumptions and Restrictions • Assumption: No Violation of Type Declarations • Restrictions: • Conceptually Significant • No Virtual Functions • No Function Pointers • No Exceptions • Operations Access Only • Parameters • Read-Only Data • Data Members Declared in Class of the Receiver • Implementation Convenience • No Multiple Inheritance • No Templates • No union, struct or enum Types • No typedef Declarations • Global Variables Must Be of Class Types • No Static Members • No Default Arguments or Variable Numbers of Arguments • No Numeric Casts

  44. Applications • Barnes-Hut • O(NlgN) N-Body Solver • Space Subdivision Tree • 1500 Lines of C++ Code • Water • Simulates Liquid Water • O(N^2) Algorithm • 1850 Lines of C++ Code • String • Computes Model of Geology Between Two Oil Wells • 2050 Lines of C++ Code

  45. Obtaining Serial C++ Version of Barnes-Hut • Started with Explicitly Parallel Version (SPLASH-2) • Removed Parallel Constructs to get Serial C • Converted to Clean Object-Based C++ • Major Structural Changes • Eliminated Scheduling Code and Data Structures • Split a Loop in Force Computation Phase • Introduced New Field into Particle Data Structure

  46. Obtaining Serial C++ Version of Water • Started with Serial C Translated from Fortran • Converted to Clean Object-Based C++ • Major Structural Change • Auxiliary Objects for O(N^2) phases

  47. Obtaining Serial C++ Version of String • Started With Serial C Translated From Fortran • Converted to Clean C++ • No Major Structural Changes

  48. Ideal Ideal Explicitly Parallel Explicitly Parallel Commutativity Analysis Commutativity Analysis Performance Results for Barnes-Hut and Water 16 16 12 12 Speedup Speedup 8 8 4 4 0 0 0 4 8 12 16 0 4 8 12 16 Number of Processors Number of Processors Water on DASH 512 Molecules Barnes-Hut on DASH 16K Particles

  49. Ideal Explicitly Parallel Commutativity Analysis 16 12 8 4 0 0 4 8 12 16 Performance Results for String Speedup Number of Processors String on DASH Big Well Model

  50. Synchronization Optimizations • Generated A Version of Each Application for Each Lock Coarsening Policy • Original • Bounded • Aggressive • Dynamic Feedback • Ran Applications on Stanford DASH Machine

More Related