x10 tutorial psc software productivity study may 23 27 2005
Download
Skip this Video
Download Presentation
X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005

Loading in 2 Seconds...

play fullscreen
1 / 63

X10 Tutorial PSC Software Productivity Study May 23 27, 2005 - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005. Vivek Sarkar IBM T.J. Watson Research Center [email protected] This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Acknowledgments.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'X10 Tutorial PSC Software Productivity Study May 23 27, 2005' - melody


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
x10 tutorial psc software productivity study may 23 27 2005

X10 TutorialPSC Software Productivity StudyMay 23 – 27, 2005

Vivek Sarkar

IBM T.J. Watson Research Center

[email protected]

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

acknowledgments
Acknowledgments
  • Additional contributors to PSC Productivity Study
    • David Bader
    • Bill Clark
    • Nick Nystrom
    • John Urbanic
  • X10 core team
    • Philippe Charles
    • Chris Donawa
    • Kemal Ebcioglu
    • Christian Grothoff
    • Allan Kielstra
    • Christoph von Praun
    • Vivek Sarkar
    • Vijay Saraswat
  • X10 productivity team
    • Catalina Danis
    • Christine Halverson

X10 Tutorial 2

outline
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 3

what is x10
What is X10?
  • X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS)
  • X10’s goal is to provide a new parallel programming model and its embodiment in a high level language that:
    • is more productive than current models,
    • can support higher levels of abstraction better than current models, and
    • can exploit the multiple levels of parallelism and nonuniform data access that are critical for obtaining scalable performance in current and future HPC systems,

X10 Tutorial 4

x10 status and schedule
6/2003 PERCS programming model concept (end of PERCS Phase 1)

7/2004 Start of PERCS Phase 2

2/2004 Kickoff of X10 as concrete embodiment of PERCS programming model as a new language

7/2004 First draft of X10 language specification

2/2005 First X10 implementation -- unoptimized single-VM prototype

Emulates distributed parallelism in a single process

This is what you will use to run X10 programs this week

5/2005 X10 productivity study at Pittsburgh Supercomputing Center

7/2005 Results from X10 application & productivity studies

2H2005 Revise language based on application & productivity feedback

2H2005 Start participation in High Productivity Language “consortium”?

1/2006 Second X10 implementation – optimized multi-VM prototype

6/2006 Open source release of X10 reference implementation

6/2006 Design completed for production X10 implementation in Phase 3 (end of Phase 2)

X10 status and schedule

X10 Tutorial 5

current x10 environment unoptimized single vm implementation
Current X10 Environment:Unoptimized Single-VM Implementation

X10 source program --- must contain a class named Foo with a “public static void main(String[] args) method

Foo.x10

x10c Foo.x10

x10c

X10 compiler --- translates Foo.x10 to Foo.java,

uses javac to generate Foo.class from Foo.java

X10 program translated into Java ---

// #line pseudocomment in Foo.java specifies source line mapping in Foo.x10

Foo.class

Foo.java

x10 Foo.x10

X10 extern

interface

X10 Virtual Machine

(JVM + J2SE libraries +

X10 libraries +

X10 Multithreaded Runtime)

External DLL’s

X10 Abstract Performance Metrics

(event counts, distribution efficiency)

X10 Program Output

Caveat: this is a prototype implementation with many limitations. Please be patient!

X10 Tutorial 6

examples of x10 compiler error messages
Examples of X10 Compiler Error Messages

Case 1: Error message identifies source file and line number

1) x10c TutError1.x10

TutError1.x10:8: Could not find field or local variable "evenSum".

for (int i = 2 ; i <= n ; i += 2 ) evenSum += i;

^----^

2) x10c TutError2.x10

x10c: TutError2.x10:4:27:4:27: unexpected token(s) ignored

3) x10c TutError3.x10

x10c: C:\vivek\eclipse\workspace\x10\examples\Tutorial\TutError3.java:49:

local variable n is accessed from within inner class; needs to be declared

final

Case 1: Carats indicate column range

Case 2: Error message identifies source file, line number, and column range

Case 3: Error message reported by Java compiler – look for #line comment in .java file to identify X10 source location

X10 Tutorial 7

future x10 environment
Future X10 Environment

Implicit parallelism,

Implicit data distributions

Very High Level Languages (VHLL’s),

Domain Specific Languages (DSL’s)

X10 places --- abstraction of explicit control & data distribution

X10 Libraries

X10 High Level Language

Mapping of places to nodes in HPC Parallel Environment

X10 Deployment

HPC Runtime Environment

(Parallel Environment, MPI, LAPI, …)

Primitive constructs for parallelism, communication, and synchronization

HPC Parallel System

Target system for parallel application

X10 Tutorial 8

future x10 environment targeting scalable hpc parallel systems

C-Node 63

C-Node 63

C-Node 0

C-Node 0

“Thin”

X10 VM

“Thin”

X10 VM

“Thin”

X10 VM

“Thin”

X10 VM

Future X10 Environment: Targeting Scalable HPC Parallel Systems

Front-endNodes

FileServers

interconnect

. . .

Pset 0

I/O Node 0

“Thick”

X10 VM

“Full” X10 VM

Functional Gigabit Ethernet

Console

. . .

interconnect

I/O Node 1023

“Thick”

X10 VM

. . .

Pset 1023

X10 Tutorial 9

future x10 environment targeting scalable hpc parallel systems1

Proc Cluster

Proc Cluster

. . .

. . .

PEs,

L1 $

PEs,

L1 $

PEs,

L1 $

PEs,

L1 $

L2 Cache

L2 Cache

. . .

C-Node 0

C-Node 63

C-Node 0

C-Node 63

“Thin”

X10 VM

“Thin”

X10 VM

“Thin”

X10 VM

“Thin”

X10 VM

. . .

L3 Cache

Memory

Future X10 Environment: Targeting Scalable HPC Parallel Systems

Front-endNodes

FileServers

interconnect

. . .

Pset 0

I/O Node 0

“Thick”

X10 VM

“Full X10 VM”

Functional Gigabit Ethernet

Console

. . .

interconnect

I/O Node 1023

“Thick”

X10 VM

. . .

Pset 1023

X10 Tutorial 10

x10 vs java
X10 vs. Java
  • X10 is an extended subset of Java
    • Base language = Java 1.4
      • Java 5 features (generics, metadata, etc.) are currently not supported in X10
    • Notable features removed from Java
      • Concurrency --- threads, synchronized, etc.
      • Java arrays – replaced by X10 arrays
    • Notable features added to Java
      • Concurrency – async, finish, atomic, future, force, foreach, ateach, clocks
      • Distribution --- points, distributions
      • X10 arrays --- multidimensional distributed arrays, array reductions, array initializers,
      • Serial constructs --- nullable, const, extern, value types
  • X10 supports both OO and non-OO programming paradigms

X10 Tutorial 11

x10 lang standard library
x10.lang standard library
  • Java package with “built in” classes that provide support for selected X10 constructs
    • Standard types
      • boolean, byte, char, double, float, int, long, short, String
    • x10.lang.Object -- root class for all instances of X10 objects
    • x10.lang.clock --- clock instances & clock operations
    • x10.lang.dist --- distribution instances & distribution operations
    • x10.lang.place --- place instances & place operations
    • x10.lang.point --- point instances & point operations
    • x10.lang.region --- region instances & region operations
  • All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes
    • e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), …
  • Similarly, all X10 programs also implicitly import the java.lang.* package
    • e.g., X10 programs can use Math.min() and Math.max() from java.lang

X10 Tutorial 12

calling foreign functions from x10 programs
Calling foreign functions from X10 programs
  • Java methods
    • Can be called directly from X10 programs
    • Java class will be loaded automatically as part of X10 program execution
    • Basic rule: don’t call any method that can perform wait/notify or related thread operations
      • Calling synchronized methods is okay
  • C functions
    • Need to use extern declaration in X10, and perform a System.loadLibrary() call

X10 Tutorial 13

resources available in current x10 installation
Resources available in current X10 installation
  • Readme.txt --- basic information on X10 installation and usage
  • Limitations.txt --- list of known limitations in the current X10 implementation
  • etc/standard.cfg --- default configuration information
  • examples/ -- root directory for a number of working X10 example programs
    • examples/Constructs shows usage of different X10 constructs
    • examples/Tutorial contains examples used in this tutorial

X10 Tutorial 14

outline1
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 15

x10 programming model single place

Activities

. . .

X10 Programming Model (Single Place)

Immutable Data (I)

-- final variables, value type instances

  • Activity = lightweight thread
    • Main program starts as single activity in Place 0
  • Permitted object references (pointers);
    • I H, H  I, II, HH, SH, S->I,
  • Prohibited references:
    • H  S, I  S, S  S
    • No data sharing permitted between parent activity’s stack and child activity’s stack
  • Single Place Memory model
    • No coherence constraints needed for I and S storage classes
    • Guaranteed coherence for H storage class --- all writes to same shared location are observed in same order by all activities
    • Largest deployment granularity for a single place is a single SMP

Shared Heap (H)

Place 0

Locally

Synchronous

(coherent access to intra-place shared heap)

Activity Stacks (S)

Storage classes:

  • Immutable Data (I)
  • Shared Heap (H)
  • Activity Stacks (S)

X10 Tutorial 16

basic x10 single place
Basic X10 (Single Place)

Core constructs used for intra-place (shared memory) parallel programming:

  • Async = construct used to execute a statement in parallel as a new activity
  • Finish = construct used to check for global termination of statement and all the activities that it has created
  • Atomic = construct used to coordinate accesses to shared heap by multiple activities
  • Future = construct used to evaluate an expression in parallel as a new activity
  • Force = construct used to check for termination of future

X10 Tutorial 17

async statement
async statement
  • async <stmt>
    • Parent activity creates a new child activity to execute <stmt> in the same place as the parent activity
    • An async statement returns immediately – parent execution proceeds immediately to next statement
    • Any access to parent’s local data must be through final variables
      • Similar to data access rules for inner classes in Java
  • Example

public class TutAsync {

const boxedInt oddSum=new boxedInt();

const boxedInt evenSum=new boxedInt();

public static void main(String[] args) {

final int n = 100;

async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i;

for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;

Variable n must be declared as final --- its value is passed from parent to child activity

X10 Tutorial 18

finish statement
finish statement
  • finish <stmt>
    • Execute <stmt> as usual, but wait until all activities spawned (transitively) by <stmt> have terminated before completing the execution of finish S
    • finish traps all exceptions thrown by activities spawned by S, and throws a wrapping exception after S has terminated.
  • Example (see TutAsync.x10):

. . .

finish {

async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i;

for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;

}

// Both oddSum and evenSum have been computed now

System.out.println("oddSum = " + oddSum.val +

" ; evenSum = " + evenSum.val);

} // main()

} // TutAsync

Console output:

oddSum = 2500 ; evenSum = 2550

X10 Tutorial 19

atomic statements methods
atomic <stmt>, atomic <method-decl>

An atomic statement/method is conceptually executed in a single step, while other activities are suspended

Note: programmer does not manage any locks explicitly

An atomic section may not include

Blocking operations

Creation of activities

Example (see TutAtomic1.x10):

finish {

async for (int i=1 ; i<=n ; i+=2 ) {

double r = 1.0d / i ; atomic rSum += r;

}

for (int j=2 ; j<=n ; j+=2 ) {

double r = 1.0d / j ; atomic rSum += r;

}

}

System.out.println("rSum = " + rSum);

Atomic statements & methods

Console output:

rSum = 5.187377517639618

X10 Tutorial 20

another example tutatomic2 x10
Another Example (TutAtomic2.x10)

public class TutAtomic2 {

const int a = new boxedInt(100);

const int b = new boxedInt(100);

public static atomic void incr_a() { a.val++ ; b.val-- ; }

public static atomic void decr_a() { a.val-- ; b.val++ ; }

public static void main(String args[]) {

int sum;

finish {

async for (int i=1 ; i<=10 ; i++ ) incr_a();

for (int i=1 ; i<=10 ; i++ ) decr_a();

}

atomic sum = a.val + b.val;

System.out.println("a+b = " + sum);

} // main()

} // TutAtomic2

Console output:

a+b = 200

X10 Tutorial 21

future force
Future & Force
  • future<type> F = future { <expr> }
    • Parent activity creates a new asynchronous child activity at <place> to evaluate <expr>
  • <type> value = F.force()
    • Caller blocks until return value is obtained from future (and all activities spawned transitively by <expr> have terminated )
  • Example (see TutFuture2.x10):

// Note that future<int> and int are different types

future<int> Fi = future { fib(10) } ;

int i = Fi.force();

// Nested future types can also be created (if need be)

future<future<int>> FFj= future { future{fib(100)} };

future<int> Fj = FFj.force();

int j = Fj.force();

X10 Tutorial 22

example tutfuture1 x10
Example (TutFuture1.x10)

Example of recursive divide-and-conquer parallelism --- calls to fib(n-1) and fib(n-2) execute in parallel

public class TutFuture1 {

static int fib(final int n) {

if ( n <= 0 ) return 0;

else if ( n == 1 ) return 1;

else {

future<int> fn_1 = future { fib(n-1) };

future<int> fn_2 = future { fib(n-2) };

return fn_1.force() + fn_2.force();

}

} // fib()

public static void main(String[] args) {

System.out.println("fib(10) = " + fib(10));

} // main()

} // TutFuture1

X10 Tutorial 23

parallel programming pitfalls deadlock
Parallel Programming Pitfalls: Deadlock
  • Deadlock occurs when parallel threads/activities acquire locks or perform other blocking operations in a sequence that creates a dependence cycle
  • Java example:
    • Thread 0
      • synchronized (Foo.a) { synchronized(Foo.b) { … } }
    • Thread 1
      • synchronized (Foo.b) { synchronized(Foo.a) { … } }
  • MPI example:
    • Process 0:
      • MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, …)
    • Process 1:
      • MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, …)

X10 Tutorial 24

parallel programming pitfalls deadlock contd
Parallel Programming Pitfalls: Deadlock (contd.)
  • X10 guarantee
    • Any program written with async, finish, atomic, foreach, ateach, and clock parallel constructs will never deadlock
  • Unrestricted use of future and force may lead to deadlock (see examples/Constructs/Future/FutureDeadlock_MustFailTimeout.x10):
    • f1 = future { a1() } ;
    • f2 = future { a2() };
    • int a1() { … f2.force(); … }
    • Int a2() { … f1.force(); … }
  • Restricted use of future and force in X10 can preserve guaranteed freedom from deadlocks
    • Sufficient condition #1: ensure that activity that creates the future also performs the force() operation
    • Sufficient condition #2: . . .

X10 Tutorial 25

parallel programming pitfalls data races
Parallel Programming Pitfalls: Data Races
  • A data race occurs when two (or more) threads/activities can access the same shared location in parallel such that one of the accesses is a write operation
  • Java example:
    • Thread 0: a++ ; b-- ;
    • Thread 1: a++ ; b--;
    • Data race can violate invariant that (a+b) is constant
    • Data race may also prevent multiple increments from being combined correctly
  • X10 guidelines for avoiding data races
    • Use atomic methods and blocks without worrying about deadlock
    • Declare data to be read-only (i.e., final or value type instance) whenever possible

X10 Tutorial 26

outline2
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 27

points
Points
  • A point is anelement of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …
    • Dimensions are numbered from 0 to n-1
    • n is also referred to as the rank of the point
  • A point variable can hold values of different ranks e.g.,
    • point p; p = [1]; … p = [2,3]; …
  • The following operations are defined on a point-valued expression p1
    • p1.rank --- returns rank of point p1
    • p1.get(i) --- returns element i of point p1
      • Returns element (i mod p1.rank) if i < 0 or i >= p1.rank
    • p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2)
      • Returns true iff p1 is lexicographically <, <=, >, or >= p2
      • Only defined when p1.rank and p1.rank are equal

X10 Tutorial 28

example see tutpoint x10
Example (see TutPoint.x10)

public class TutPoint {

public static void main(String[] args) {

point p1 = [1,2,3,4,5];

point p2 = [1,2];

point p3 = [2,1];

System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1.get(2) = " + p1.get(2));

System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3));

} // main()

} // TutPoint

Console output:

p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1.get(2) = 3

p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true

X10 Tutorial 29

rectangular regions
Rectangular Regions
  • A rectangular region is the set of points contained in a rectangular subspace
  • A region variable can hold values of different ranks e.g.,
    • region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; …
  • The following operations are defined on a region-valued expression R
    • R.rank = # dimensions in region; R.size() = # points in region
    • R.contains(P) = true if region R contains point P
    • R.contains(S) = true if region R contains region S
    • R.equal(S) = true if region R equals region S
    • R.rank(i) = projection of region R on dimension i (a one-dimensional region)
    • R.rank(i).low() = lower bound of ith dimension of region R
    • R.rank(i).high() = upper bound of ith dimension of region R
    • R.ordinal(P) = ordinal value of point P in region R
    • R.coord(N) = point in region R with ordinal value = N
    • R1 && R2 = region intersection (will be rectangular if R1 and R2 are rectangular)
    • R1 || R2 = union of regions R1 and R2 (may not be rectangular)
    • R1 – R2 = region difference (may not be rectangular)

X10 Tutorial 30

example see tutregion x10
Example (see TutRegion.x10)

public class TutRegion {

public static void main(String[] args) {

region R1 = [1:10, -100:100];

System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100]));

region R2 = [1:10,90:100];

System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0));

} // main()

} // TutRegion

Console output:

R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009

R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90]

X10 Tutorial 31

x10 arrays
Java arrays are one-dimensional and local

e.g., array args in main(String[] args)

Multi-dimensional arrays are represented as “arrays of arrays” in Java

X10 has true multi-dimensional arrays (as in C, Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.)

Array declaration

“T [.] A” declares an X10 array with element type T

An array variable can hold values of different rank)

The [.] syntax is used to avoid confusion with Java arrays

Array creation

“new T [ R ]” creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type

e.g., int[.] A = new int[ [0:N+1, 0:N+1] ];

Array initializers can also be specified in conjunction with creation (see TutArray1.x10)

E.g., int[.] A = new int[ [1:10,1:10] ] (point[i,j]) { return i+j; } ;

X10 Arrays

X10 Tutorial 32

x10 array operations
The following operations are defined on array-valued expression s

A.rank = # dimensions in array

A.region = index region (domain) of array

A[P] = element at point P, where P belongs to A.region

A | R = restriction of array onto region R

Useful for extracting subarrays

A.sum(), A.max() = sum/max of elements in array

A1 op A2 returns result of applying a pointwise op on array elements, when A1.region = A2. region

Op can include +, -, *, and /

A1 || A2 = disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint)

A1.overlay(A2)

Returns an array with region, A1.region || A2.region, with element value A2[P] for all points P in A2.region and A1[P] otherwise.

A.distribution = distribution of array A

Will be discussed later when we introduce X10 places

X10 Array Operations

X10 Tutorial 33

example see tutarray1 x10
Example (see TutArray1.x10)

public class TutArray1 {

public static void main(String[] args) {

int[.] A = new int[ [1:10,1:10] ]

(point [i,j]) { return i+j;} ;

System.out.println("A.rank = " + A.rank +

" ; A.region = " + A.region);

int[.] B = A | [1:5,1:5];

System.out.println("B.max() = " + B.max());

} // main()

} // TutArray1

Console output:

A.rank = 2 ; A.region = {1:10,1:10}

B.max() = 10

X10 Tutorial 34

pointwise for loop
Pointwise for loop
  • X10 extends Java’s for loop to support sequential iteration over points in region R in canonical lexicographic order
    • for ( point p : R ) . . .
  • Standard point operations can be used to extract individual index values from point p
    • for ( point p : R ) { int i = p.get(0); int j = p.get(1); . . . }
  • Or an “exploded” syntax can be used instead of explicitly declaring a point variable
    • for ( point [i,j] : R ) { . . . }
  • The exploded syntax declares the constituent variables (i, j, …) as local int variables in the scope of the for loop body

X10 Tutorial 35

example see tutfor x10
Example (see TutFor.x10)

public class TutFor {

public static void main(String[] args) {

region R = [0:1,0:2];

System.out.print("Points in region " + R + " =");

for ( point p : R ) System.out.print(" " + p);

System.out.println();

// Use exploded syntax instead

System.out.print("(i,j) pairs in region " + R + " =");

for ( point[i,j] : R )

System.out.print("(" + i + "," + j + ")");

System.out.println();

} // main()

} // TutFor

Console output:

Points in region {0:1,0:2} = [0,0] [0,1] [0,2] [1,0] [1,1] [1,2]

(i,j) pairs in region {0:1,0:2} =(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)

X10 Tutorial 36

foreach loop parallel iteration
foreach loop (Parallel iteration)
  • The X10 foreach loop is similar to the pointwise for loop, except that each iteration executes in parallel as a new asynchronous activity i.e.,
    • “foreach ( point p : R ) S” is equivalent to “for ( point p : R ) async S”
  • As before, finish can be used to wait for termination of all foreach iterations
    • finish foreach ( point[i,j] : [0:M-1,0:N-1] ) . . .
  • Special case: use foreach to create a single-dimensional parallel loop
    • foreach ( point[i] : [0 : N-1] ) S
  • Allowing a single foreach construct to span multiple dimensions makes it convenient to write parallel matrix code that is independent of the underlying rank and region e.g.
    • foreach ( point p : A.region ) A[p] = f(B[p], C[p], D[p]) ;
  • Multiple foreach instances may accesses shared data in the same place  use finish, atomic, force to avoid data races

X10 Tutorial 37

example see tutforeach1 x10
Example (see TutForeach1.x10)

public class TutForeach1 {

public static void main(String[] args) {

final int N = 5;

int[.] A = new int[[1:N,1:N]] (point[i,j]) {return i+j;};

// For the A[i,j] = F(A[i,j]) case,

// both loops can execute in parallel

finish foreach ( point[i,j] : A.region )

A[i,j] = A[i,j] + 1;

// For the A[i,j] = F(A[i,j-1]) case,

// only the outer loop can execute in parallel

finish foreach ( point[i] : A.region.rank(0) )

for (point[j]:

[(A.region.rank(1).low()+1):A.region.rank(1).high()])

A[i,j] = A[i,j-1] + 1;

NOTE: A.region.rank(0) is the same as [1:N]

X10 Tutorial 38

example contd see tutforeach1 x10
Example contd. (see TutForeach1.x10)

// For the A[i,j] = F(A[i-1,j]) case,

// only the inner loop can execute in parallel

for (point[i]:

[(A.region.rank(0).low()+1):A.region.rank(0).high()] )

finish foreach ( point[j] : A.region.rank(1) )

A[i,j] = A[i-1,j] + 1;

// For the A[i,j] = F(A[i-1,j],A[i,j-1]) case,

// use loop skewing to execute the inner loop in parallel

for ( point[t] : [4:2*N]) {

finish foreach ( point[j] : [Math.max(2,t-N):Math.min(N,t-2)]) {

int i = t - j;

System.out.print("(" + i + "," + j + ")");

A[i,j] = A[i-1,j] + A[i,j-1] + 1;

}

System.out.println();

Console output:

(2,2)

(3,2)(2,3)

(4,2)(3,3)(2,4)

(5,2)(3,4)(4,3)(2,5)

(5,3)(4,4)(3,5)

(5,4)(4,5)

(5,5)

X10 Tutorial 39

outline3
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 40

limitations of using a single place

Activities

. . .

Limitations of using a Single Place

Immutable Data (I)

-- final variables, value type instances

  • Largest deployment granularity for a single place is a single SMP
    • Smallest granularity can be a single CPU or even a single hardware thread
  • Single SMP is inadequate for solving problems with large memory and compute requirements
  • X10 solution: incorporate multiple places as a core foundation of the X10 programming model

 Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism

Shared Heap (H)

Place 0

Locally

Synchronous

(coherent access to intra-place shared heap)

Activity Stacks (S)

Storage classes:

  • Immutable Data (I)
  • Shared Heap (H)
  • Activity Stacks (S)

X10 Tutorial 41

scalable x10 using multiple places

Activities

Activities

. . .

. . .

Scalable X10: using multiple places

Immutable Data (I)

-- final variables, value type instances

Partitioned Global Address Space (PGAS)

Local Heap (LH)

Local Heap (LH)

Outbound activities

Inbound activities

Globally

Asynchronous

Locally

Synchronous

(coherent access to intra-place shared heap)

Activity Stacks (S)

Activity Stacks (S)

. . .

Inbound activity replies

Outbound activity

replies

Place 0

Place (MAX_PLACES -1)

Storage classes:

  • Immutable Data (I)
  • PGAS
    • Local Heap (LH)
    • Remote Heap (RH)
  • Activity Stacks (S)
  • Place = collection of activities & objects
    • Activities and data objects do not move after being created
  • Scalar object, O -- maps to a single place specified by O.location
  • Array object, A – may be local to a place or distributed across multiple places, as specified by A.distribution

X10 Tutorial 42

locality rule
Locality Rule
  • Any access to a mutable (shared heap) datum must be performed by an activity located at the place as the datum
  • The prohibited references are similar as before:
    • LH/RH  S, I  S, S  S
  • Local-to-remote (LH  RH) and remote-to-local (RH  LH) heap references are freely permitted
  • However, direct access via a remote heap reference is not permitted!
  • Inter-place data accesses can only be performed by creating remote activities (with weaker ordering guarantees than intra-place data accesses)
  • The locality rule is currently not checked by default. Instead, the user can perform the check explicitly by inserting a place cast operator as follows:
    • “(@ P) E” checks if expression E can be evaluated at place P
      • If so, expression E is evaluated as usual
      • If not, a BadPlaceException is thrown

X10 Tutorial 43

activity execution within a place
Activity Execution within a Place

Place

Atomic sections do not have blocking semantics

Ready

Activities

Executing

Activities

Inbound

activities

Outbound activities

Place-local activity can only its stack (S), place-local heap (LH), or immutable data (I)

Blocked

Activities

Completed

Activities

Clock

. . .

Outbound

replies

Inbound replies

Future

X10 Tutorial 44

places
Places
  • place.MAX_PLACES = total number of places
    • Default value is 4
    • Can be changed by using the -NUMBER_OF_LOCAL_PLACES option in x10 command
  • place.places = Set of all places in an X10 program(see java.lang.Set)
  • place.factory.place(i) = place corresponding to index i
  • here = place in which current activity is executing
  • <place-expr>.toString() returns a string of the form “place(id=99)”
  • <place-expr>.id returns the id of the place

X10 Data Structures

X10 language defines mapping from X10 objects to X10 places, and abstract performance metrics on places

X10 Places

Future X10 deployment system will define mapping from X10 places to system nodes; not supported in current implementation

System Nodes

X10 Tutorial 45

extension of async and future to places
Extension of async and future to places
  • async (P) S
    • Creates new activity to execute statement S at place P
    • “async S” is equivalent to “async (here) S”
  • future (P) { E }
    • Create new activity to evaluate expression E at place P
    • “future { E } ” is equivalent to “future (here) { E }”
  • Note that “here” in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing
  • The goal is to specify the destination place for async/future activities so as to obey the Locality Rule e.g.,
    • async (O.location) O.x = 1;
    • future<int> F = future (A.distribution[i]) { A[i] } ;

X10 Tutorial 46

distribution mapping from region to places
Distribution = mapping from region to places
  • Creating distributions (x10.lang.dist):
    • dist D1 = R-> here; // local distribution – maps region R to here
    • dist D2 = dist.factory.block(R); // blocked distribution
    • dist D3 = dist.factory.cyclic(R); // cyclic distribution
    • dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1]
  • Using distributions
    • D[P] = place to which point P is mapped by distribution D (assuming that P is in D.region)
    • Allocate a distributed array e.g., T[.] A = new T[ D ];
      • Allocates an array with index set = D.region, such that element A[P] is located at place D[P] for each point P in D.region
      • NOTE: “new T[R]” for region R is equivalent to “new T[R->here]”
    • Iterating over a distribution – generalization of “foreach” to “ateach”
      • ateach is discussed in more detail later

X10 Tutorial 47

operations defined on distributions
Operations defined on distributions
  • D.region = source region of distribution
  • D.rank = rank of D.region
  • D | R = region restriction for distribution D and region R (returns a restricted distribution)
  • D | P = place restriction for distribution D and place P (returns region mapped by D to place P)
  • D1 || D2 = union of distributions D1 and D2 (assumes that D1.region and D2.region are disjoint)
  • D1.overlay(D2); // Overlay of D2 over D1 – asymmetric union
  • D.contains(p) = true iff D.region contains point p
  • D = R -> P, constant distribution which maps entire region R to place P
  • D1 – D2 = distribution difference = D1 | (D1.region – D2.region)
  • D.distributionEfficiency() = load balance efficiency of distribution D

X10 Tutorial 48

inter place communication using async and future
Inter-place communication using async and future
  • Question: how to assign A[i] = B[j], when A[i] and and B[j] may be in different places?
  • Answer #1 --- use nested async’s!

finish async ( B.distribution[j] ) {

final int bb = B[j];

async ( A.distribution[i] ) A[i] = bb;

}

  • Answer #2 --- use future-force and an async!

final int b = future (B.distribution[j]) { B[j] }.force();

finish async ( A.distribution[i] ) A[i] = b;

X10 Tutorial 49

load balance efficiency
Load Balance Efficiency
  • Consider a parallel application that is executed on P places
  • Let T(i) = computation load mapped to place i
    • For distribution D, T(i) = (D | place.factory.place(i)).size()
  • Let Tmax = max { T(i) | 1 <= i <= P }
  • Let E = SUM { T(i) | 1 <= i <= P } / (Tmax * P)
  • E is the load balance efficiency, 1/P <= E <= 1
  • E = 1 is the best case  computation load is perfectly balanced
  • E = 1/P is the worst case  computation load is placed on a single processor/place
  • Load balance efficiency is one of the key factors that limit speedup on a parallel machine
    • there are several other factors e.g., comm. & synchronization overhead
    • ignoring other factors, we expect speedup to be <= E * P
  • NOTE: also try “x10 –DUMP_STATS_ON_EXIT=true …” to see activity and atomic counts

X10 Tutorial 50

ateach loop distributed parallel iteration
ateach loop (distributed parallel iteration)
  • The X10 ateach loop is similar to the foreach loop, except that each iteration executes in parallel at a place specified by a distribution
    • “ateach ( point p : D ) S ” is equivalent to “for ( point p : D.region ) async (D[p]) S”
  • As before, finish can be used to wait for termination of all ateach iterations
    • “finish ateach( point[i] : dist.factory.unique() ) S” creates one activity per place, as in an SPMD computation
    • ateach is a convenient construct for writing parallel matrix code that is independent of the underlying distribution e.g.,
      • ateach ( point p : A.distribution ) A[p] = f(B[p], C[p], D[p]) ;

X10 Tutorial 51

example see tutateach1 x10
Example (see TutAteach1.x10)

public class TutAteach1 {

public static void main(String args[]) {

finish ateach( point[i] : dist.factory.unique() ) {

System.out.println("Hello from " + i);

}

} // main()

} // TutAteach1

dist.factory.unique() maps point i in the region, [0 : place.MAX_PLACES-1], to place place.factory.place(i)

Console output:

Hello from 1

Hello from 0

Hello from 3

Hello from 2

X10 Tutorial 52

example converting foreach to ateach see tutateach2 x10
Example: converting foreach to ateach (see TutAteach2.x10)

foreach version:

// For the A[i,j] = F(A[i,j]) case,

// both loops can execute in parallel

finish foreach ( point[i,j] : A.region )

A[i,j] = A[i,j] + 1;

ateach version #1:

finish ateach ( point[i,j] : A.distribution)

A[i,j] = A[i,j] + 1;

ateach version #2 (create only one activity per place):

finish ateach ( point p : dist.factory.unique() )

for ( point[i,j] : A.distribution | here )

A[i,j] = A[i,j] + 1;

X10 Tutorial 53

example converting foreach to ateach contd see tutateach2 x10
Example: converting foreach to ateach, contd. (see TutAteach2.x10)

foreach version:

// For the A[i,j] = F(A[i,j-1]) case,

// only the outer loop can execute in parallel

finish foreach ( point[i] : [1:N] )

for ( point[j]: [2:N] )

A[i,j] = A[i,j-1] + 1;

ateach version:

// Assume that N is a multiple of place.MAX_PLACES

finish ateach ( point[i] : dist.factory.block([1:N]) )

for ( point[j]: [2:N] )

A[i,j] = A[i,j-1] + 1;

X10 Tutorial 54

outline4
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 55

x10 clocks motivation
X10 clocks: Motivation
  • Activity coordination using finish and force() is accomplished by checking for activity termination
  • However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination
    • The activities involved may be in the same place or in different places

Phase 0

Phase 1

. . .

. . .

Activity 0

Activity 1

Activity 2

X10 Tutorial 56

x10 clocks
clock c = clock.factory.clock();

Allocate a clock, register current activity with it. Phase 0 of c starts.

async(…) clocked (c1,c2,…) S

ateach(…) clocked (c1,c2,…) S

foreach(…) clocked (c1,c2,…) S

Create async activities registered on clocks c1, c2, …

c.resume();

Nonblocking operation that signals completion of work by current activity for this phase of clock c

next;

Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed.

Next can be viewed like a “finish” of all computations under way in the current phase of the clock

X10 Clocks

X10 Tutorial 57

x10 clocks contd
c.drop();

Unregister with c. A terminating activity will implicitly drop all clocks that it is registered on.

c.registered()

Return true iff current activity is registered on clock c

c.dropped() returns the opposite of c.registered()

ClockUseException

Thrown if an activity attempts to transmit or operate on a clock that it is not registered on

X10 Clocks (contd.)

X10 Tutorial 58

example see tutclock1 x10
Example (see TutClock1.x10)

Example of transmitting clock from parent to child

finish async {

final clock c = clock.factory.clock();

foreach (point[i]: [1:N]) clocked (c) {

while ( true ) {

int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]);

if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]);

if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]);

A[i] = new_A_i;

next;

int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]);

if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]);

if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]);

B[i] = new_B_i;

next;

if ( old_A_i == new_A_i && old_B_i == new_B_i ) break;

} // while

} // foreach

} // finish async

NOTE: exiting from while loop terminates activity for iteration i, and automatically deregisters activity from clock

X10 Tutorial 59

outline5
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 60

nullable
nullable
  • By default, object references in X10 are not allowed to take on the null value
  • However, the nullable type constructor can be used to enable certain object references to be set to null, or to compare them with null e.g.,

T1 a;

nullable T2 b;

a = null; // Not allowed

b = null; // Allowed

  • NOTE: “const” is simply a shorthand for “static final”

X10 Tutorial 61

extern
extern
  • X10 provides a simple mechanism for invoking external functions written in C
  • Currently, the C function is restricted to arguments with primitive types or references to “unsafe” X10 arrays
  • The X10 program must contain an external declaration of the C function as follows …

static extern char doit(int a, float b)

… and also a statement to ensure that the native DLL, <dll>.dll is loaded

static { System.loadLibrary(“<dll>");}

  • The X10 compiler then generates a file called <class>_x10stub.c
  • To generate the DLL, the C programmer must compile the C function by including the file jni.h in tehir C function, and must link with the object file obtained from <class>_x10stub.c

X10 Tutorial 62

outline6
What is X10?

background, status

Basic X10 (single place)

async, finish, atomic

future, force

Basic X10 (arrays & loops)

points, rectangular regions, arrays

for, foreach

Scalable X10 (multiple places)

places, distributions, distributed arrays, ateach, BadPlaceException

Clocks

creation, registration, next, resume, drop, ClockUseException

Basic serial constructs that differ from Java

const, nullable, extern

Advanced topics

Value types, conditional atomic sections (when), general regions & distributions

Refer to language spec for details

Outline

X10 Tutorial 63

ad