Enhancing High-Performance Java Clusters with Smart System Imaging

“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu

Cluster Coming of Age • HPC • Cluster the de facto standard equipment • Grid? • Clusters • Fortran or C + MPI the norm • 99% on top of bare-bone Linux or the like • Ok if application is embarrassingly parallel and regular ICPP-HPSECA03

Cluster for the Mass Commercial: Data mining, Financial Modeling, Oil Reservoir Simulation, Seismic Data Processing, Vehicle and Aircraft Simulation Government: Nuclear Stockpile Stewardship, Climate and Weather, Satellite Image Processing, Forces Modeling Academic: Fundamental Physics (particles, relativity, cosmology), Biochemistry, Environmental Engineering, Earthquake Prediction • Two modes: • For number crunching in Grande type applications (superman) • As a CPU farm to support high-throughput computing (poor man) ICPP-HPSECA03

Cluster Programming • Auto-parallelization tools have limited success • Parallelization a chore but “have to do it” (or let’s hire someone) • Optimization for performance not many users’ cup of tea • Partitioning and parallelization • Mapping • Remapping (experts?) ICPP-HPSECA03

Amateur Parallel Programming • Common problems • Poor parallelization: few large chunks or many small chunks • Load imbalances: large and small chunks • Meeting the amateurs half-way • They do crude parallelization • System does the rest: mapping/remapping (automatic optimization) • And I/O? ICPP-HPSECA03

Automatic Optimization • “Feed the fat boy with two spoons, and a few slim ones with one spoon” • But load information could be elusive • Need smart runtime supports • Goal is to achieve high performance with good resource utilization and load balancing • Large chunks that are single-threaded a problem ICPP-HPSECA03

The Good “Fat Boys” • Large chunks that span multiple nodes • Must be a program with multiple execution “threads” • Threads can be in different nodes – program expands and shrinks • Threads/programs can roam around – dynamicmigration • This encourages fine-grain programming cluster node “amoeba” ICPP-HPSECA03

Mechanism and Policy • Mechanism for migration • Traditional process migration • Thread migration • Redirection of I/O and messages • Objects sharing between nodes for threads • Policy for good dynamic load balancing • Message traffic a crucial parameter • Predictive • Towards the “single system image” ideal ICPP-HPSECA03

Single System Image • If user does only crude parallelization and system does the rest … • If processes/threads can roam, and processes expand/shrink … • If I/O (including sockets) can be at any node anytime … • We achieve at least 50% of SSI • The rest is difficult Single Entry Point File System Virtual Networking I/O and Memory Space Process Space Management / Programming View … ICPP-HPSECA03

Bon Java! • Java (for HPC) in good hands • JGF Numerics Working Group, IBM Ninja, … • JGF Concurrency/Applications Working Group (benchmarking, MPI, …) • The workshops • Java has many advantages (vs. Fortran and C/C++) • Performance not an issue any more • Threads as first-class citizens! • JVM can be modified “Java has the greatest potential to deliver an attractive productive programming environment spanning the very broad range of tasks needed by the Grande programmer ” – The Java Grande Forum Charter ICPP-HPSECA03

Process vs. Thread Migration • Process migration easier than thread migration • Threads are tightly coupled • They share objects • Two styles to explore • Process, MPI (“distributed computing”) • Thread, shared objects (“parallel computing”) • Or combined • Boils down to messages vs. distributed shared objects ICPP-HPSECA03

Two Projects @ HKU • M-JavaMPI– “M” for “Migration” • Process migration • I/O redirection • Extension to grid • No modification of JVM and MPI • JESSICA – “Java-Enabled Single System Image Computing Architecture” • By modifying JVM • Thread migration, Amoeba mode • Global object space, I/O redirection • JIT mode (Version 2) ICPP-HPSECA03

Design Choices • Bytecode instrumentation • Insert code into programs, manually or via pre-processor • JVM extension • Make thread state accessible from Java program • Non-transparent • Modification of JVM is required • Checkpointing the whole JVM process • Powerful but heavy penalty • Modification of JVM • Runtime support • Totally transparent to the applications • Efficient but very difficult to implement ICPP-HPSECA03

M-JavaMPI • Support transparent Java process migration and provide communication redirection services • Communication using MPI • Implemented as a middleware on top of standard JVM • No modifications of JVM and MPI • Checkpointing the Java process + code insertion by preprocessor ICPP-HPSECA03

System Architecture ICPP-HPSECA03

Preprocessing • Bytecode is modified before passing to JVM for execution • “Restoration functions” are inserted as exception handlers, in the form of encapsulated “try-catch” statements • Re-arrangement of bytecode, and addition of local variables ICPP-HPSECA03

The Layers • Java-MPI API layer • Restorable MPI layer • Provides restorable MPI communications • No modification of MPI library • Migration Layer • Captures and save the execution state of the migrating process in the source node, and restores the execution state of the migrated process in the destination node • Cooperates with the Restorable MPI layer to reconstruct the communication channels of the parallel application ICPP-HPSECA03

State Capturing and Restoring • Program code: re-used in the destination node • Data: captured and restored by using the object serialization mechanism • Execution context: captured by using JVMDI and restored by inserted exception handlers • Eager (all) strategy: For each frame, local variables, referenced objects, the name of the class and class method, and program counter are saved using object serialization ICPP-HPSECA03

State Capturing using JVMDI public class A { int a; char b; … } public class A { try { … } catch (RestorationException e) { a = saved value of local variable a; b = saved value of local variable b; pc = saved value of program counter when the program is suspended jump to the location where the program is suspended } } ICPP-HPSECA03

Message Redirection Model • MPI daemon in each node to support message passing between distributed java processes • IPC between Java program and MPI daemon in the same node through shared memory and semaphores client-server client-server ICPP-HPSECA03

Process migration steps Source Node Destination Node ICPP-HPSECA03

Experiments • PC Cluster • 16-node cluster • 300 MHz Pentium II with 128MB of memory • Linux 2.2.14 with Sun JDK 1.3.0 • 100Mb/s fast Ethernet • All Java programs executed in interpreted mode ICPP-HPSECA03

Bandwidth:PingPong Test • Native MPI: 10.5 MB/s • Direct Java-MPI binding: 9.2 MB/s • Restorable MPI layer: 7.6 MB/s ICPP-HPSECA03

Latency:PingPong Test • Native MPI: 0.2 ms • Direct Java-MPI binding: 0.23 ms • Restorable MPI layer: 0.26 ms ICPP-HPSECA03

Migration Cost: capturing and restoring objects ICPP-HPSECA03

Migration Cost: capturing and restoring frames ICPP-HPSECA03

Application Performance • PI calculation • Recursive ray-tracing • NAS integer sort • Parallel SOR ICPP-HPSECA03

Time spent in calculating PI and ray-tracing with and without the migration layer ICPP-HPSECA03

Problem size (no. of integers) Time used (sec) in environment without M-JavaMPI Time used (sec) in environment with M-JavaMPI Overhead introduced by M-JavaMPI (in %) Total Comp Comm Total Comp Comm Total Comm Class S: 65536 0.023 0.009 0.014 0.026 0.009 0.017 13% 21% Class W:1048576 0.393 0.182 0.212 0.424 0.182 0.242 7.8% 14% Class A: 8388608 3.206 1.545 1.66 3.387 1.546 1.840 5.6% 11% Execution time of NAS program with different problem sizes (16 nodes) No noticeable overhead introduced in the computation part; while in the communication part, an overhead of about 10-20% ICPP-HPSECA03

Time spent in executing SOR using different numbers of nodes with and without migration layer ICPP-HPSECA03

Cost of Migration Time spent in executing the SOR program on an array of size 256x256 without and with one migration during the execution ICPP-HPSECA03

Applications Average migration time PI 2 Ray-tracing 3 NAS 2 SOR 3 Cost of Migration • Time spent in migration (in seconds) for different applications ICPP-HPSECA03

Dynamic Load Balancing • A simple test • SOR program was executed using six nodes in an unevenly loaded environment with one of the nodes executing a computationally intensive program • Without migration : 319s • With migration: 180s ICPP-HPSECA03

In Progress • M-JavaMPI in JIT mode • Develop system modules for automatic dynamic load balancing • Develop system modules for effective fault-tolerant supports ICPP-HPSECA03

Java Virtual Machine Application Class File Java API Class File • Class Loader • Loads class files • Interpreter • Executes bytecode • Runtime Compiler • Converts bytecode to native code Class loader Bytecode Interpreter 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 Runtime compiler 01010101000101110 10101011000111010 10110011010111011 Native code ICPP-HPSECA03

A Multithreaded Java Program Threads in JVM public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } } Thread 3 Thread 2 Thread 1 Java Method Area (Code) PC Class loader Stack Frame Execution Engine Stack Frame Class files Heap (Data) object object ICPP-HPSECA03

Load variable from main memory to working memory before use. Upon T1 performs unlock, variable is written back to main memory Upon T2 performs lock, flush variable in working memory When T2 uses variable, it will be loaded from main memory Java Memory Model (How to maintain memory consistency between threads) JMM T1 T2 Variable is modified in T1’s working memory. Per-Thread working memory Main memory Garbage Bin Object master copy Heap Area Variable Threads: T1, T2 ICPP-HPSECA03

Problems in Existing DJVMs • Mostly based on interpreters • Simple but slow • Layered design using distributed shared memory system (DSM)  cannot be tightly coupled with JVM • JVM runtime information cannot be channeled to DSM • False sharing if page-based DSM is employed • Page faults block the whole JVM • Programmer to specify thread distribution  lack of transparency • Need to rewrite multithreaded Java applications • No dynamic thread distribution (preemptive thread migration) for load balancing ICPP-HPSECA03

Related Work • Method shipping:IBM cJVM • Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. • Executed in Interpreter mode. • Load balancing problem : affected by the object distribution. • Page shipping:Rice U. Java/DSM, HKU JESSICA • Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) • JVM runtime information can’t be channeled to DSM. • Executed in Interpreter mode. • Object shipping:Hyperion, Jackal • Leverage some object-based DSM • Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code ICPP-HPSECA03

Distributed Java Virtual Machine (DJVM) JESSICA2: A distributed Java Virtual Machine (DJVM)spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with aSingle System Imageillusion to Java threads. Java Threads created in a program Global Object Space OS OS OS OS PC PC PC PC High Speed Network ICPP-HPSECA03

JESSICA2 Main Features JESSICA2 • Transparent Java thread migration • Runtime capturing and restoring of thread execution context. • No source code modification; no bytecode instrumentation (preprocessing); no new API introduced • Enables dynamic load balancing on clusters • Operated in Just-In-Time (JIT) compilation Mode • Global Object Space • A shared global heap spanning all cluster nodes • Adaptive object home migration protocol • I/O redirection Transparent migration JIT GOS ICPP-HPSECA03

Transparent Thread Migration in JIT Mode • Simple for interpreters (e.g., JESSICA) • Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking • The full state of a thread is available in the data structure of interpreter • No register allocation • JIT mode execution makes things complex (JESSICA2) • Native code has no clear bytecode boundary • How to deal with machine registers? • How to organize the stack frames (all are in native form now)? • How to make extracted thread states portable and recognizable by the remote JVM? • How to restore the extracted states (rebuild the stack frames) and restart the execution in native form? Need to modify JIT compiler to instrument native code ICPP-HPSECA03

Frame An overview of JESSICA2 Java thread migration • Frame parsing • Restore execution Thread GOS (heap) (3) Frames Frames Frames Migration Manager (4a) Object Access GOS (heap) Method Area Frame PC • Stack analysis • Stack capturing (2) Method Area Thread Scheduler JVM PC (4b) Load method from NFS Source node (1) Alert Destination node Load Monitor ICPP-HPSECA03

Essential Functions • Migration points selection • At the start of loop, basic block or method • Register context handler • Spill dirty registers at migration point without invalidation so that native code can continue the use of registers • Use register recovering stub at restoring phase • Variable type deduction • Spill type in stacks using compression • Java frames linking • Discover consecutive Java frames ICPP-HPSECA03

Dynamic Thread State Capturing and Restoring in JESSICA2 migration point Bytecode verifier migration point Selection (Restore) cmp mflag,0 jz ... invoke register allocation bytecode translation cmp obj[offset],0 jz ... 1. Add migration checking 2. Add object checking 3. Add type & register spilling Intermediate Code mov 0x110182, slot ... Register recovering code generation reg slots Native Code Global Object Access (Capturing) Linking & Constant Resolution Native stack scanning Java frame mov slot1->reg1 mov slot2->reg2 ... C frame ICPP-HPSECA03 Frame Native thread stack

How to Maintain Memory Consistency in a Distributed Environment? T1 T2 T3 T4 T5 T6 T7 T8 Heap Heap OS OS OS OS PC PC PC PC High Speed Network ICPP-HPSECA03

Embedded Global Object Space (GOS) • Take advantage of JVM runtime information for optimization (e.g., object types, accessing threads, etc.) • Use threaded I/O interface inside JVM for communication to hide the latency  Non-blocking GOS access • OO-based to reduce false sharing • Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) • Master heap (home objects) and cache heap (local and cached objects): reduce object access latency ICPP-HPSECA03

Object Cache ICPP-HPSECA03

Adaptive object home migration • Definition • “home” of an object = the JVM that holds the master copy of an object • Problems • cache objects need to be flushed and re-fetched from the home whenever synchronization happens • Adaptive object home migration • if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running ICPP-HPSECA03

I/O redirection • Timer • Use the time in master node as the standard time • Calibrate the time in worker node when they register to master node • File I/O • Use half word of “fd” as node number • Open file • For read, check local first, then master node • For write, go to master node • Read/Write • Go to the node specified by the node number in fd • Network I/O • Connectionless send: do it locally • Others, go to master ICPP-HPSECA03

Enhancing High-Performance Java Clusters with Smart System Imaging

Enhancing High-Performance Java Clusters with Smart System Imaging

Presentation Transcript

Preparing a Java Program

Java Network Programming

ON TO JAVA

Lecture 9 Java GUI

Java 3D Introduction

Starting Out with Java: From Control Structures through Objects Third Edition by Tony Gaddis

JAVA 程序设计实例教程

Towards Certifiably Correct Java Card Applets

Distributed Applications in Java