1 / 63

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005. Vivek Sarkar IBM T.J. Watson Research Center vsarkar@us.ibm.com This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Acknowledgments.

melody
Download Presentation

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. X10 TutorialPSC Software Productivity StudyMay 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center vsarkar@us.ibm.com This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

  2. Acknowledgments • Additional contributors to PSC Productivity Study • David Bader • Bill Clark • Nick Nystrom • John Urbanic • X10 core team • Philippe Charles • Chris Donawa • Kemal Ebcioglu • Christian Grothoff • Allan Kielstra • Christoph von Praun • Vivek Sarkar • Vijay Saraswat • X10 productivity team • Catalina Danis • Christine Halverson X10 Tutorial 2

  3. What is X10? background, status Basic X10 (single place) async, finish, atomic future, force Basic X10 (arrays & loops) points, rectangular regions, arrays for, foreach Scalable X10 (multiple places) places, distributions, distributed arrays, ateach, BadPlaceException Clocks creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java const, nullable, extern Advanced topics Value types, conditional atomic sections (when), general regions & distributions Refer to language spec for details Outline X10 Tutorial 3

  4. What is X10? • X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS) • X10’s goal is to provide a new parallel programming model and its embodiment in a high level language that: • is more productive than current models, • can support higher levels of abstraction better than current models, and • can exploit the multiple levels of parallelism and nonuniform data access that are critical for obtaining scalable performance in current and future HPC systems, X10 Tutorial 4

  5. 6/2003 PERCS programming model concept (end of PERCS Phase 1) 7/2004 Start of PERCS Phase 2 2/2004 Kickoff of X10 as concrete embodiment of PERCS programming model as a new language 7/2004 First draft of X10 language specification 2/2005 First X10 implementation -- unoptimized single-VM prototype Emulates distributed parallelism in a single process This is what you will use to run X10 programs this week 5/2005 X10 productivity study at Pittsburgh Supercomputing Center 7/2005 Results from X10 application & productivity studies 2H2005 Revise language based on application & productivity feedback 2H2005 Start participation in High Productivity Language “consortium”? 1/2006 Second X10 implementation – optimized multi-VM prototype 6/2006 Open source release of X10 reference implementation 6/2006 Design completed for production X10 implementation in Phase 3 (end of Phase 2) X10 status and schedule X10 Tutorial 5

  6. Current X10 Environment:Unoptimized Single-VM Implementation X10 source program --- must contain a class named Foo with a “public static void main(String[] args) method Foo.x10 x10c Foo.x10 x10c X10 compiler --- translates Foo.x10 to Foo.java, uses javac to generate Foo.class from Foo.java X10 program translated into Java --- // #line pseudocomment in Foo.java specifies source line mapping in Foo.x10 Foo.class Foo.java x10 Foo.x10 X10 extern interface X10 Virtual Machine (JVM + J2SE libraries + X10 libraries + X10 Multithreaded Runtime) External DLL’s X10 Abstract Performance Metrics (event counts, distribution efficiency) X10 Program Output Caveat: this is a prototype implementation with many limitations. Please be patient! X10 Tutorial 6

  7. Examples of X10 Compiler Error Messages Case 1: Error message identifies source file and line number 1) x10c TutError1.x10 TutError1.x10:8: Could not find field or local variable "evenSum". for (int i = 2 ; i <= n ; i += 2 ) evenSum += i; ^----^ 2) x10c TutError2.x10 x10c: TutError2.x10:4:27:4:27: unexpected token(s) ignored 3) x10c TutError3.x10 x10c: C:\vivek\eclipse\workspace\x10\examples\Tutorial\TutError3.java:49: local variable n is accessed from within inner class; needs to be declared final Case 1: Carats indicate column range Case 2: Error message identifies source file, line number, and column range Case 3: Error message reported by Java compiler – look for #line comment in .java file to identify X10 source location X10 Tutorial 7

  8. Future X10 Environment Implicit parallelism, Implicit data distributions Very High Level Languages (VHLL’s), Domain Specific Languages (DSL’s) X10 places --- abstraction of explicit control & data distribution X10 Libraries X10 High Level Language Mapping of places to nodes in HPC Parallel Environment X10 Deployment HPC Runtime Environment (Parallel Environment, MPI, LAPI, …) Primitive constructs for parallelism, communication, and synchronization HPC Parallel System Target system for parallel application X10 Tutorial 8

  9. C-Node 63 C-Node 63 C-Node 0 C-Node 0 “Thin” X10 VM “Thin” X10 VM “Thin” X10 VM “Thin” X10 VM Future X10 Environment: Targeting Scalable HPC Parallel Systems Front-endNodes FileServers interconnect . . . Pset 0 I/O Node 0 “Thick” X10 VM “Full” X10 VM Functional Gigabit Ethernet Console . . . interconnect I/O Node 1023 “Thick” X10 VM . . . Pset 1023 X10 Tutorial 9

  10. Proc Cluster Proc Cluster . . . . . . PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $ L2 Cache L2 Cache . . . C-Node 0 C-Node 63 C-Node 0 C-Node 63 “Thin” X10 VM “Thin” X10 VM “Thin” X10 VM “Thin” X10 VM . . . L3 Cache Memory Future X10 Environment: Targeting Scalable HPC Parallel Systems Front-endNodes FileServers interconnect . . . Pset 0 I/O Node 0 “Thick” X10 VM “Full X10 VM” Functional Gigabit Ethernet Console . . . interconnect I/O Node 1023 “Thick” X10 VM . . . Pset 1023 X10 Tutorial 10

  11. X10 vs. Java • X10 is an extended subset of Java • Base language = Java 1.4 • Java 5 features (generics, metadata, etc.) are currently not supported in X10 • Notable features removed from Java • Concurrency --- threads, synchronized, etc. • Java arrays – replaced by X10 arrays • Notable features added to Java • Concurrency – async, finish, atomic, future, force, foreach, ateach, clocks • Distribution --- points, distributions • X10 arrays --- multidimensional distributed arrays, array reductions, array initializers, • Serial constructs --- nullable, const, extern, value types • X10 supports both OO and non-OO programming paradigms X10 Tutorial 11

  12. x10.lang standard library • Java package with “built in” classes that provide support for selected X10 constructs • Standard types • boolean, byte, char, double, float, int, long, short, String • x10.lang.Object -- root class for all instances of X10 objects • x10.lang.clock --- clock instances & clock operations • x10.lang.dist --- distribution instances & distribution operations • x10.lang.place --- place instances & place operations • x10.lang.point --- point instances & point operations • x10.lang.region --- region instances & region operations • All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes • e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), … • Similarly, all X10 programs also implicitly import the java.lang.* package • e.g., X10 programs can use Math.min() and Math.max() from java.lang X10 Tutorial 12

  13. Calling foreign functions from X10 programs • Java methods • Can be called directly from X10 programs • Java class will be loaded automatically as part of X10 program execution • Basic rule: don’t call any method that can perform wait/notify or related thread operations • Calling synchronized methods is okay • C functions • Need to use extern declaration in X10, and perform a System.loadLibrary() call X10 Tutorial 13

  14. Resources available in current X10 installation • Readme.txt --- basic information on X10 installation and usage • Limitations.txt --- list of known limitations in the current X10 implementation • etc/standard.cfg --- default configuration information • examples/ -- root directory for a number of working X10 example programs • examples/Constructs shows usage of different X10 constructs • examples/Tutorial contains examples used in this tutorial X10 Tutorial 14

  15. What is X10? background, status Basic X10 (single place) async, finish, atomic future, force Basic X10 (arrays & loops) points, rectangular regions, arrays for, foreach Scalable X10 (multiple places) places, distributions, distributed arrays, ateach, BadPlaceException Clocks creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java const, nullable, extern Advanced topics Value types, conditional atomic sections (when), general regions & distributions Refer to language spec for details Outline X10 Tutorial 15

  16. Activities . . . X10 Programming Model (Single Place) Immutable Data (I) -- final variables, value type instances • Activity = lightweight thread • Main program starts as single activity in Place 0 • Permitted object references (pointers); • I H, H  I, II, HH, SH, S->I, • Prohibited references: • H  S, I  S, S  S • No data sharing permitted between parent activity’s stack and child activity’s stack • Single Place Memory model • No coherence constraints needed for I and S storage classes • Guaranteed coherence for H storage class --- all writes to same shared location are observed in same order by all activities • Largest deployment granularity for a single place is a single SMP Shared Heap (H) Place 0 Locally Synchronous (coherent access to intra-place shared heap) Activity Stacks (S) Storage classes: • Immutable Data (I) • Shared Heap (H) • Activity Stacks (S) X10 Tutorial 16

  17. Basic X10 (Single Place) Core constructs used for intra-place (shared memory) parallel programming: • Async = construct used to execute a statement in parallel as a new activity • Finish = construct used to check for global termination of statement and all the activities that it has created • Atomic = construct used to coordinate accesses to shared heap by multiple activities • Future = construct used to evaluate an expression in parallel as a new activity • Force = construct used to check for termination of future X10 Tutorial 17

  18. async statement • async <stmt> • Parent activity creates a new child activity to execute <stmt> in the same place as the parent activity • An async statement returns immediately – parent execution proceeds immediately to next statement • Any access to parent’s local data must be through final variables • Similar to data access rules for inner classes in Java • Example public class TutAsync { const boxedInt oddSum=new boxedInt(); const boxedInt evenSum=new boxedInt(); public static void main(String[] args) { final int n = 100; async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j; Variable n must be declared as final --- its value is passed from parent to child activity X10 Tutorial 18

  19. finish statement • finish <stmt> • Execute <stmt> as usual, but wait until all activities spawned (transitively) by <stmt> have terminated before completing the execution of finish S • finish traps all exceptions thrown by activities spawned by S, and throws a wrapping exception after S has terminated. • Example (see TutAsync.x10): . . . finish { async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j; } // Both oddSum and evenSum have been computed now System.out.println("oddSum = " + oddSum.val + " ; evenSum = " + evenSum.val); } // main() } // TutAsync Console output: oddSum = 2500 ; evenSum = 2550 X10 Tutorial 19

  20. atomic <stmt>, atomic <method-decl> An atomic statement/method is conceptually executed in a single step, while other activities are suspended Note: programmer does not manage any locks explicitly An atomic section may not include Blocking operations Creation of activities Example (see TutAtomic1.x10): finish { async for (int i=1 ; i<=n ; i+=2 ) { double r = 1.0d / i ; atomic rSum += r; } for (int j=2 ; j<=n ; j+=2 ) { double r = 1.0d / j ; atomic rSum += r; } } System.out.println("rSum = " + rSum); Atomic statements & methods Console output: rSum = 5.187377517639618 X10 Tutorial 20

  21. Another Example (TutAtomic2.x10) public class TutAtomic2 { const int a = new boxedInt(100); const int b = new boxedInt(100); public static atomic void incr_a() { a.val++ ; b.val-- ; } public static atomic void decr_a() { a.val-- ; b.val++ ; } public static void main(String args[]) { int sum; finish { async for (int i=1 ; i<=10 ; i++ ) incr_a(); for (int i=1 ; i<=10 ; i++ ) decr_a(); } atomic sum = a.val + b.val; System.out.println("a+b = " + sum); } // main() } // TutAtomic2 Console output: a+b = 200 X10 Tutorial 21

  22. Future & Force • future<type> F = future { <expr> } • Parent activity creates a new asynchronous child activity at <place> to evaluate <expr> • <type> value = F.force() • Caller blocks until return value is obtained from future (and all activities spawned transitively by <expr> have terminated ) • Example (see TutFuture2.x10): // Note that future<int> and int are different types future<int> Fi = future { fib(10) } ; int i = Fi.force(); // Nested future types can also be created (if need be) future<future<int>> FFj= future { future{fib(100)} }; future<int> Fj = FFj.force(); int j = Fj.force(); X10 Tutorial 22

  23. Example (TutFuture1.x10) Example of recursive divide-and-conquer parallelism --- calls to fib(n-1) and fib(n-2) execute in parallel public class TutFuture1 { static int fib(final int n) { if ( n <= 0 ) return 0; else if ( n == 1 ) return 1; else { future<int> fn_1 = future { fib(n-1) }; future<int> fn_2 = future { fib(n-2) }; return fn_1.force() + fn_2.force(); } } // fib() public static void main(String[] args) { System.out.println("fib(10) = " + fib(10)); } // main() } // TutFuture1 X10 Tutorial 23

  24. Parallel Programming Pitfalls: Deadlock • Deadlock occurs when parallel threads/activities acquire locks or perform other blocking operations in a sequence that creates a dependence cycle • Java example: • Thread 0 • synchronized (Foo.a) { synchronized(Foo.b) { … } } • Thread 1 • synchronized (Foo.b) { synchronized(Foo.a) { … } } • MPI example: • Process 0: • MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, …) • Process 1: • MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, …) X10 Tutorial 24

  25. Parallel Programming Pitfalls: Deadlock (contd.) • X10 guarantee • Any program written with async, finish, atomic, foreach, ateach, and clock parallel constructs will never deadlock • Unrestricted use of future and force may lead to deadlock (see examples/Constructs/Future/FutureDeadlock_MustFailTimeout.x10): • f1 = future { a1() } ; • f2 = future { a2() }; • int a1() { … f2.force(); … } • Int a2() { … f1.force(); … } • Restricted use of future and force in X10 can preserve guaranteed freedom from deadlocks • Sufficient condition #1: ensure that activity that creates the future also performs the force() operation • Sufficient condition #2: . . . X10 Tutorial 25

  26. Parallel Programming Pitfalls: Data Races • A data race occurs when two (or more) threads/activities can access the same shared location in parallel such that one of the accesses is a write operation • Java example: • Thread 0: a++ ; b-- ; • Thread 1: a++ ; b--; • Data race can violate invariant that (a+b) is constant • Data race may also prevent multiple increments from being combined correctly • X10 guidelines for avoiding data races • Use atomic methods and blocks without worrying about deadlock • Declare data to be read-only (i.e., final or value type instance) whenever possible X10 Tutorial 26

  27. What is X10? background, status Basic X10 (single place) async, finish, atomic future, force Basic X10 (arrays & loops) points, rectangular regions, arrays for, foreach Scalable X10 (multiple places) places, distributions, distributed arrays, ateach, BadPlaceException Clocks creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java const, nullable, extern Advanced topics Value types, conditional atomic sections (when), general regions & distributions Refer to language spec for details Outline X10 Tutorial 27

  28. Points • A point is anelement of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], … • Dimensions are numbered from 0 to n-1 • n is also referred to as the rank of the point • A point variable can hold values of different ranks e.g., • point p; p = [1]; … p = [2,3]; … • The following operations are defined on a point-valued expression p1 • p1.rank --- returns rank of point p1 • p1.get(i) --- returns element i of point p1 • Returns element (i mod p1.rank) if i < 0 or i >= p1.rank • p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2) • Returns true iff p1 is lexicographically <, <=, >, or >= p2 • Only defined when p1.rank and p1.rank are equal X10 Tutorial 28

  29. Example (see TutPoint.x10) public class TutPoint { public static void main(String[] args) { point p1 = [1,2,3,4,5]; point p2 = [1,2]; point p3 = [2,1]; System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1.get(2) = " + p1.get(2)); System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3)); } // main() } // TutPoint Console output: p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1.get(2) = 3 p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true X10 Tutorial 29

  30. Rectangular Regions • A rectangular region is the set of points contained in a rectangular subspace • A region variable can hold values of different ranks e.g., • region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; … • The following operations are defined on a region-valued expression R • R.rank = # dimensions in region; R.size() = # points in region • R.contains(P) = true if region R contains point P • R.contains(S) = true if region R contains region S • R.equal(S) = true if region R equals region S • R.rank(i) = projection of region R on dimension i (a one-dimensional region) • R.rank(i).low() = lower bound of ith dimension of region R • R.rank(i).high() = upper bound of ith dimension of region R • R.ordinal(P) = ordinal value of point P in region R • R.coord(N) = point in region R with ordinal value = N • R1 && R2 = region intersection (will be rectangular if R1 and R2 are rectangular) • R1 || R2 = union of regions R1 and R2 (may not be rectangular) • R1 – R2 = region difference (may not be rectangular) X10 Tutorial 30

  31. Example (see TutRegion.x10) public class TutRegion { public static void main(String[] args) { region R1 = [1:10, -100:100]; System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100])); region R2 = [1:10,90:100]; System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0)); } // main() } // TutRegion Console output: R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009 R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90] X10 Tutorial 31

  32. Java arrays are one-dimensional and local e.g., array args in main(String[] args) Multi-dimensional arrays are represented as “arrays of arrays” in Java X10 has true multi-dimensional arrays (as in C, Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.) Array declaration “T [.] A” declares an X10 array with element type T An array variable can hold values of different rank) The [.] syntax is used to avoid confusion with Java arrays Array creation “new T [ R ]” creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type e.g., int[.] A = new int[ [0:N+1, 0:N+1] ]; Array initializers can also be specified in conjunction with creation (see TutArray1.x10) E.g., int[.] A = new int[ [1:10,1:10] ] (point[i,j]) { return i+j; } ; X10 Arrays X10 Tutorial 32

  33. The following operations are defined on array-valued expression s A.rank = # dimensions in array A.region = index region (domain) of array A[P] = element at point P, where P belongs to A.region A | R = restriction of array onto region R Useful for extracting subarrays A.sum(), A.max() = sum/max of elements in array A1 op A2 returns result of applying a pointwise op on array elements, when A1.region = A2. region Op can include +, -, *, and / A1 || A2 = disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint) A1.overlay(A2) Returns an array with region, A1.region || A2.region, with element value A2[P] for all points P in A2.region and A1[P] otherwise. A.distribution = distribution of array A Will be discussed later when we introduce X10 places X10 Array Operations X10 Tutorial 33

  34. Example (see TutArray1.x10) public class TutArray1 { public static void main(String[] args) { int[.] A = new int[ [1:10,1:10] ] (point [i,j]) { return i+j;} ; System.out.println("A.rank = " + A.rank + " ; A.region = " + A.region); int[.] B = A | [1:5,1:5]; System.out.println("B.max() = " + B.max()); } // main() } // TutArray1 Console output: A.rank = 2 ; A.region = {1:10,1:10} B.max() = 10 X10 Tutorial 34

  35. Pointwise for loop • X10 extends Java’s for loop to support sequential iteration over points in region R in canonical lexicographic order • for ( point p : R ) . . . • Standard point operations can be used to extract individual index values from point p • for ( point p : R ) { int i = p.get(0); int j = p.get(1); . . . } • Or an “exploded” syntax can be used instead of explicitly declaring a point variable • for ( point [i,j] : R ) { . . . } • The exploded syntax declares the constituent variables (i, j, …) as local int variables in the scope of the for loop body X10 Tutorial 35

  36. Example (see TutFor.x10) public class TutFor { public static void main(String[] args) { region R = [0:1,0:2]; System.out.print("Points in region " + R + " ="); for ( point p : R ) System.out.print(" " + p); System.out.println(); // Use exploded syntax instead System.out.print("(i,j) pairs in region " + R + " ="); for ( point[i,j] : R ) System.out.print("(" + i + "," + j + ")"); System.out.println(); } // main() } // TutFor Console output: Points in region {0:1,0:2} = [0,0] [0,1] [0,2] [1,0] [1,1] [1,2] (i,j) pairs in region {0:1,0:2} =(0,0)(0,1)(0,2)(1,0)(1,1)(1,2) X10 Tutorial 36

  37. foreach loop (Parallel iteration) • The X10 foreach loop is similar to the pointwise for loop, except that each iteration executes in parallel as a new asynchronous activity i.e., • “foreach ( point p : R ) S” is equivalent to “for ( point p : R ) async S” • As before, finish can be used to wait for termination of all foreach iterations • finish foreach ( point[i,j] : [0:M-1,0:N-1] ) . . . • Special case: use foreach to create a single-dimensional parallel loop • foreach ( point[i] : [0 : N-1] ) S • Allowing a single foreach construct to span multiple dimensions makes it convenient to write parallel matrix code that is independent of the underlying rank and region e.g. • foreach ( point p : A.region ) A[p] = f(B[p], C[p], D[p]) ; • Multiple foreach instances may accesses shared data in the same place  use finish, atomic, force to avoid data races X10 Tutorial 37

  38. Example (see TutForeach1.x10) public class TutForeach1 { public static void main(String[] args) { final int N = 5; int[.] A = new int[[1:N,1:N]] (point[i,j]) {return i+j;}; // For the A[i,j] = F(A[i,j]) case, // both loops can execute in parallel finish foreach ( point[i,j] : A.region ) A[i,j] = A[i,j] + 1; // For the A[i,j] = F(A[i,j-1]) case, // only the outer loop can execute in parallel finish foreach ( point[i] : A.region.rank(0) ) for (point[j]: [(A.region.rank(1).low()+1):A.region.rank(1).high()]) A[i,j] = A[i,j-1] + 1; NOTE: A.region.rank(0) is the same as [1:N] X10 Tutorial 38

  39. Example contd. (see TutForeach1.x10) // For the A[i,j] = F(A[i-1,j]) case, // only the inner loop can execute in parallel for (point[i]: [(A.region.rank(0).low()+1):A.region.rank(0).high()] ) finish foreach ( point[j] : A.region.rank(1) ) A[i,j] = A[i-1,j] + 1; // For the A[i,j] = F(A[i-1,j],A[i,j-1]) case, // use loop skewing to execute the inner loop in parallel for ( point[t] : [4:2*N]) { finish foreach ( point[j] : [Math.max(2,t-N):Math.min(N,t-2)]) { int i = t - j; System.out.print("(" + i + "," + j + ")"); A[i,j] = A[i-1,j] + A[i,j-1] + 1; } System.out.println(); Console output: (2,2) (3,2)(2,3) (4,2)(3,3)(2,4) (5,2)(3,4)(4,3)(2,5) (5,3)(4,4)(3,5) (5,4)(4,5) (5,5) X10 Tutorial 39

  40. What is X10? background, status Basic X10 (single place) async, finish, atomic future, force Basic X10 (arrays & loops) points, rectangular regions, arrays for, foreach Scalable X10 (multiple places) places, distributions, distributed arrays, ateach, BadPlaceException Clocks creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java const, nullable, extern Advanced topics Value types, conditional atomic sections (when), general regions & distributions Refer to language spec for details Outline X10 Tutorial 40

  41. Activities . . . Limitations of using a Single Place Immutable Data (I) -- final variables, value type instances • Largest deployment granularity for a single place is a single SMP • Smallest granularity can be a single CPU or even a single hardware thread • Single SMP is inadequate for solving problems with large memory and compute requirements • X10 solution: incorporate multiple places as a core foundation of the X10 programming model  Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism Shared Heap (H) Place 0 Locally Synchronous (coherent access to intra-place shared heap) Activity Stacks (S) Storage classes: • Immutable Data (I) • Shared Heap (H) • Activity Stacks (S) X10 Tutorial 41

  42. Activities Activities . . . . . . Scalable X10: using multiple places Immutable Data (I) -- final variables, value type instances Partitioned Global Address Space (PGAS) Local Heap (LH) Local Heap (LH) Outbound activities Inbound activities Globally Asynchronous Locally Synchronous (coherent access to intra-place shared heap) Activity Stacks (S) Activity Stacks (S) . . . Inbound activity replies Outbound activity replies Place 0 Place (MAX_PLACES -1) Storage classes: • Immutable Data (I) • PGAS • Local Heap (LH) • Remote Heap (RH) • Activity Stacks (S) • Place = collection of activities & objects • Activities and data objects do not move after being created • Scalar object, O -- maps to a single place specified by O.location • Array object, A – may be local to a place or distributed across multiple places, as specified by A.distribution X10 Tutorial 42

  43. Locality Rule • Any access to a mutable (shared heap) datum must be performed by an activity located at the place as the datum • The prohibited references are similar as before: • LH/RH  S, I  S, S  S • Local-to-remote (LH  RH) and remote-to-local (RH  LH) heap references are freely permitted • However, direct access via a remote heap reference is not permitted! • Inter-place data accesses can only be performed by creating remote activities (with weaker ordering guarantees than intra-place data accesses) • The locality rule is currently not checked by default. Instead, the user can perform the check explicitly by inserting a place cast operator as follows: • “(@ P) E” checks if expression E can be evaluated at place P • If so, expression E is evaluated as usual • If not, a BadPlaceException is thrown X10 Tutorial 43

  44. Activity Execution within a Place Place Atomic sections do not have blocking semantics Ready Activities Executing Activities Inbound activities Outbound activities Place-local activity can only its stack (S), place-local heap (LH), or immutable data (I) Blocked Activities Completed Activities Clock . . . Outbound replies Inbound replies Future X10 Tutorial 44

  45. Places • place.MAX_PLACES = total number of places • Default value is 4 • Can be changed by using the -NUMBER_OF_LOCAL_PLACES option in x10 command • place.places = Set of all places in an X10 program(see java.lang.Set) • place.factory.place(i) = place corresponding to index i • here = place in which current activity is executing • <place-expr>.toString() returns a string of the form “place(id=99)” • <place-expr>.id returns the id of the place X10 Data Structures X10 language defines mapping from X10 objects to X10 places, and abstract performance metrics on places X10 Places Future X10 deployment system will define mapping from X10 places to system nodes; not supported in current implementation System Nodes X10 Tutorial 45

  46. Extension of async and future to places • async (P) S • Creates new activity to execute statement S at place P • “async S” is equivalent to “async (here) S” • future (P) { E } • Create new activity to evaluate expression E at place P • “future { E } ” is equivalent to “future (here) { E }” • Note that “here” in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing • The goal is to specify the destination place for async/future activities so as to obey the Locality Rule e.g., • async (O.location) O.x = 1; • future<int> F = future (A.distribution[i]) { A[i] } ; X10 Tutorial 46

  47. Distribution = mapping from region to places • Creating distributions (x10.lang.dist): • dist D1 = R-> here; // local distribution – maps region R to here • dist D2 = dist.factory.block(R); // blocked distribution • dist D3 = dist.factory.cyclic(R); // cyclic distribution • dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1] • Using distributions • D[P] = place to which point P is mapped by distribution D (assuming that P is in D.region) • Allocate a distributed array e.g., T[.] A = new T[ D ]; • Allocates an array with index set = D.region, such that element A[P] is located at place D[P] for each point P in D.region • NOTE: “new T[R]” for region R is equivalent to “new T[R->here]” • Iterating over a distribution – generalization of “foreach” to “ateach” • ateach is discussed in more detail later X10 Tutorial 47

  48. Operations defined on distributions • D.region = source region of distribution • D.rank = rank of D.region • D | R = region restriction for distribution D and region R (returns a restricted distribution) • D | P = place restriction for distribution D and place P (returns region mapped by D to place P) • D1 || D2 = union of distributions D1 and D2 (assumes that D1.region and D2.region are disjoint) • D1.overlay(D2); // Overlay of D2 over D1 – asymmetric union • D.contains(p) = true iff D.region contains point p • D = R -> P, constant distribution which maps entire region R to place P • D1 – D2 = distribution difference = D1 | (D1.region – D2.region) • D.distributionEfficiency() = load balance efficiency of distribution D X10 Tutorial 48

  49. Inter-place communication using async and future • Question: how to assign A[i] = B[j], when A[i] and and B[j] may be in different places? • Answer #1 --- use nested async’s! finish async ( B.distribution[j] ) { final int bb = B[j]; async ( A.distribution[i] ) A[i] = bb; } • Answer #2 --- use future-force and an async! final int b = future (B.distribution[j]) { B[j] }.force(); finish async ( A.distribution[i] ) A[i] = b; X10 Tutorial 49

  50. Load Balance Efficiency • Consider a parallel application that is executed on P places • Let T(i) = computation load mapped to place i • For distribution D, T(i) = (D | place.factory.place(i)).size() • Let Tmax = max { T(i) | 1 <= i <= P } • Let E = SUM { T(i) | 1 <= i <= P } / (Tmax * P) • E is the load balance efficiency, 1/P <= E <= 1 • E = 1 is the best case  computation load is perfectly balanced • E = 1/P is the worst case  computation load is placed on a single processor/place • Load balance efficiency is one of the key factors that limit speedup on a parallel machine • there are several other factors e.g., comm. & synchronization overhead • ignoring other factors, we expect speedup to be <= E * P • NOTE: also try “x10 –DUMP_STATS_ON_EXIT=true …” to see activity and atomic counts X10 Tutorial 50

More Related