1 / 36

The X10 Programming Language

2. A new era of mainstream parallel processing. The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance ? Profound impact on future software. . Multi-core chips. Cluster Parallelism. Heterogeneous Parallelism. 3. MPI. Library for message-passingStandardized by MPI Forum (academics, industry) mid 90s.Widely available with vendor-supported implementations.By far the most widely used infrastructure in HPC for parallel computing..

Albert_Lan
Download Presentation

The X10 Programming Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 The X10 Programming Language Vijay Saraswat IBM TJ Watson Research Center August 2007

    2. 2 A new era of mainstream parallel processing Contacts: Vivek Sarkar/Watson/IBM, Vijay Saraswat/Watson/IBM Parallelism scaling replaces frequency scaling as the foundation for increased performance. Parallelism scaling can be observed at three important levels of the hardware stack: Multi-core parallelism Hetreogeneous parallelism (as in the Cell processor) Cluster parallelism as in Blue Gene or in commodity scale-out clusters The move towards parallelism as the primary driver for system performance will have a profound software impact on software, because all software will need to be enabled to exploit parallelism. Some areas of commercial software (e.g. transaction systems) are already prepared for this trend from past investments in SMP-enablement. However, a lot of other software is predominantly single-threaded has been riding the frequency scaling curve predicted by Moore’s Law for the last two decades. The goal of the X10 project in IBM Research is to respond to this future software crisis by establishing new foundations for programming models, languages, tools, compilers, runtimes, virtual machines, and libraries for parallel hardware.Contacts: Vivek Sarkar/Watson/IBM, Vijay Saraswat/Watson/IBM Parallelism scaling replaces frequency scaling as the foundation for increased performance. Parallelism scaling can be observed at three important levels of the hardware stack: Multi-core parallelism Hetreogeneous parallelism (as in the Cell processor) Cluster parallelism as in Blue Gene or in commodity scale-out clusters The move towards parallelism as the primary driver for system performance will have a profound software impact on software, because all software will need to be enabled to exploit parallelism. Some areas of commercial software (e.g. transaction systems) are already prepared for this trend from past investments in SMP-enablement. However, a lot of other software is predominantly single-threaded has been riding the frequency scaling curve predicted by Moore’s Law for the last two decades. The goal of the X10 project in IBM Research is to respond to this future software crisis by establishing new foundations for programming models, languages, tools, compilers, runtimes, virtual machines, and libraries for parallel hardware.

    3. 3 MPI Library for message-passing Standardized by MPI Forum (academics, industry) mid 90s. Widely available with vendor-supported implementations. By far the most widely used infrastructure in HPC for parallel computing. … But very low-level Explicit, static management of handshakes is cumbersome, error-prone. Explicit management of distribution is cumbersome, error-prone (cf MG, HPL) Not suitable for fine-grained concurrency, adaptive computation, uneven task lengths Code that is not Bulk Synchronous Parallel Performance challenges from network support for one-sided memory access, multicore.

    4. 4 Java Java 1.1 had support for multi-threading. Program can create multiple threads. Upto a limit. Monitor-based concurrency control wait, notify. Monolithic heap – no support for distribution. Cumbersome memory model. Lock-based concurrency control … no support for lock-free algorithms. No support for fine-grained concurrency. No support for closures, value types. Poor support for arrays.

    5. 5 The X10 Programming Model

    6. 6 async async (P) S Creates a new child activity at place P, that executes statement S Returns immediately S may reference final variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled

    7. 7 finish finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. Rooted exception model Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. implicit finish at main activity finish is useful for expressing “synchronous” operations on (local or) remote data.

    8. 8 atomic, when Atomic blocks are Executed in a single step, conceptually, while other activities are suspended. An atomic block may not Block Access remote data. Create activities. Contain a conditional block. Essentially, body is a bounded, sequential, non-blocking activity Hence executing in a single place.

    9. 9 X10 v1.01 Cheat sheet PPoPP: Vijay – upgrade to 1.01 syntax.PPoPP: Vijay – upgrade to 1.01 syntax.

    10. 10 X10 v1.01 Cheat sheet: Array support : Vijay – upgrade to 1.01 syntax. : Vijay – upgrade to 1.01 syntax.

    11. 11 An example of finish and async: dfs spanning tree

    12. 12 An example of clocking: bfs spanning tree

    13. 13 Types

    14. 14 Dependent types Class or interface that is a function of values. Programmer specifies properties of a type – public final instance fields. Programmer may specify refinement types as predicates on properties T(v1,…,vn: c) all instances of t with the values fi==vi satisfying c. c is a boolean expression over predefined predicates.

    15. 15 Place types Every X10 reference inherits the property (place loc) from X10RefClass. Constraints can be placed on this property, e.g. loc==here loc == x.loc No constraints implied data can be anywhere. Place types are checked by place-shifting operators (async, future).

    16. 16 Region and distribution types

    17. 17 Work–stealing for fine grained scheduling

    18. 18 CWS extensions being investigated Decouple call-stack from deque ? strict series/parallel graphs. (cf. dfs/bfs) Global Termination Detection Detect when all workers are stealing, none is executing work-items, and there are no messages in flight. Global Quiescence Detection Do this repeatedly (for a global clock). ? Need two deques/worker. Permit use of area-specific data-structures E.g. for reduction.

    19. 19 Work-stealing with network traffic Use polling mode. On network call, worker may discover incoming asyncs and move them to its deque When idle, steal from other workers or poll.

    20. 20 Workstealing via Dekker

    21. 21 Deadlock freedom

    22. 22 Deadlock freedom Where is this useful? Whenever synchronization pattern of a program is independent of the data read by the program True for a large majority of HPC codes. (Usually not true of reactive programs.) More general, data-dependent type systems for deadlock-freedom? Central theorem of X10: Arbitrary programs with async, atomic, finish, clocks are deadlock-free. Key intuition: atomic is deadlock-free. finish has a tree-like structure. clocks are made to satisfy conditions which ensure tree-like structure. Hence no cycles in wait-for graph.

    23. 23 Determinacy in X10

    24. 24 Imperative Programming Revisited Variables Variable=Value in a Box Read: fetch current value Write: change value Stability condition: Value does not change unless a write is performed Very powerful Permit repeated many-writer, many-reader communication through arbitrary reference graphs Mutability in the presence of sharing Permits different variables to change at different rates. Asynchrony introduces indeterminacy May write out either 0 or 1. Bugs due to races are very difficult to debug.

    25. 25 Determinate Imperative Programming Key idea Sequence of assignments to a det variable is viewed as a stream. Each activity carries an index for each det variable. Index is increased on every read and write. Ensure through type-system that at each index exactly one writer can write. ? No races! Any (recursive) asynchronous Kahn network can be represented thus.

    26. 26 Current Status

    27. 27 X10DT: Enhancing productivity Code editing Refactoring Code visualization Data visualization Debugging Static performance analysis

    28. 28 Operational X10 implementation (since 02/2005) X10 Compiler (06/2007)

    29. 29 X10Flash Distributed runtime In C/C++ On top of messaging library (GASNet, ARMCI, LAPI) Targeted for high-performance clusters of SMPs. X10lib Runtime also made available as a standalone library. Supporting global address space, places, asyncs, clocks, futures etc. Performance goal To be competitive with MPI Release schedule Internal demonstration 12/07 External release 2008

    30. 30 Conclusion New era for concurrency research. Concurrency is now mainstream – affects million of developers. Practical research focused on concurrent language design, analysis techniques, type systems, compiler development, application support….can have a major impact.

    31. 31 Acknowledgments Recent Publications "Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005. “A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005. Tutorials TiC 2006, PACT 2006, OOPSLA06 X10 Core Team Rajkishore Barik Vincent Cave Chris Donawa Allan Kielstra Igor Peshansky Christoph von Praun Vijay Saraswat Vivek Sarkar Tong Wen X10 Tools Philippe Charles Julian Dolby Robert Fuhrer Frank Tip Mandana Vaziri Emeritus Kemal Ebcioglu Christian Grothoff Research colleagues R. Bodik, G. Gao, R. Jagadeesan, J. Palsberg, R. Rabbah, J. Vitek Several others at IBM

    32. 32 Global RandomAccess

    33. 33 Global Random Access benchmark

    34. 34 Theory of Memory Models

    35. 35 Background: Why memory models? Current architectures optimize for single-thread execution. Sequential Consistency (SC) is not consistent with these optimizations. So weaker memory model desired. Fundamental Property (FP): Programs whose SC executions have no races should have only SC executions. Who should have responsibility for race-freedom? Implementation / Programmer Programmer may know a lot about the computation. Idea: make semantics as weak as possible, while preserving FP. Need precise semantics!

    36. 36 Test Case 7: Single-thread reordering (r1=z,r2=x;y=r2 | r3=y;z=r3;x=1) r1,r2,y=z,x,x | r3,z,x=y,y,1 -- CO,CO r1,r2,y=z,x,x | r3,z=y,y | x=1 -- DE x=1; r1,r2,y=z,x,x | r3,z=y,y -- AU x,r1,r2,y=1,z,1,1 | r3,z=y,y -- CO x,r1,r2=1,z,1 | y=1 | r3,z=y,y -- DE x,r1,r2=1,z,1 | y=1 ; r3,z=y,y -- AU x,r1,r2=1,z,1 | y,r3,z=1,1,1 -- CO y,r3,z=1,1,1; x,r1,r2=1,z,1 -- AU y,r3,z,x,r1,r2=1,1,1,1,1,1 -- CO The last step is a process w/ a single step So: Yes!

    37. 37 Technical overview Model sequential execution through steps Finite functions over partial stores. Model concurrent execution through a pomset of steps Partial order reflects the “happens before” order. Memory Model = Transformations Combine steps Break apart steps Simplify steps Add hb edges Add “links” Develop a formal calculus to establish all possible behaviors. All Causality Cases can be dealt with in a simple fashion. Model synchronization constructs raw and shared variables async, finish atomic, isolated … more constructs?

More Related