1 / 46

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Safe and Efficient Cluster Communication in Java using Explicit Memory Management. Chi-Chao Chang Dept. of Computer Science Cornell University. Goal. High-performance cluster computing with safe languages parallel and distributed applications Use off-the-shelf technologies Java

altessa
Download Presentation

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang Dept. of Computer Science Cornell University

  2. Goal High-performance cluster computing with safe languages • parallel and distributed applications Use off-the-shelf technologies • Java • safe: “better C++” • “write once run everywhere” • growing interest for high-performance applications (Java Grande) • User-level network interfaces (UNIs) • direct, protected access to network devices • prototypes: U-Net (Cornell), Shrimp (Princeton), FM (UIUC) • industry standard: Virtual Interface Architecture (VIA) • cost-effective clusters: new 256-processor cluster @ Cornell TC 2

  3. Apps RMI, RPC Sockets Active Messages, MPI, FM UNI Networking Devices Java Networking Traditional “front-end” approach • pick favorite abstraction (sockets, RMI, MPI) and Java VM • write a Java front-end to custom or existing native libraries • good performance, re-use proven code • magic in native code, no common solution Interface Java with Network Devices • bottom-up approach • minimizes amount of unverified code • focus on fundamental data transfer inefficiencies due to: 1. Storage safety 2. Type safety Java C 3

  4. Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: VI Architecture and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 4

  5. (1) Storage Safety Java programs are garbage-collected • no explicit de-allocation: GC tracks and frees garbage objects • programs are oblivious to the GC scheme used: non-copying (e.g. conservative) or copying • no control over location of objects Modern Network and I/O Devices • direct DMA from/into user buffers • native code is necessary to interface with hardware devices 5

  6. NI NI (1) Storage Safety Result: Hard Separation between GC and native heaps Application Memory Application Memory Pin-on-demand only works for send/write operations • For receive/read operations, GC must be disabled indefinitely... GC heap Native heap GC heap Native heap pin copy RAM RAM ON OFF DMA OFF OFF DMA pin (a) Hard Separation: Copy-on-demand (b) Optimization: Pin-on-demand 6

  7. (1) Storage Safety: Effect Best case scenario: 10-40% hit in throughput • pick your favorite JVM, your fastest network interface, and a pair of 450Mhz P-II with commodity OS • pinning on demand is expensive... 7

  8. (2) Type Safety Cannot forge a reference to a Java object • b is an array of bytes • in C: double *data = (double *)b; • in Java: double[] data = new double[1024/8]; for (int i=0,off=0;i<1024/8;i++,off+=8) { int upper = (((b[off]&0xff)<<24) + ((b[off+1]&0xff)<<16) + ((b[off+2]&0xff)<<8) + (b[off+3]&0xff)); int lower = (((b[off+4]&0xff)<<24) + ((b[off+5]&0xff)<<16) + ((b[off+6]&0xff)<<8) + (b[off+7]&0xff)); data[i] = Double.toLongBits(((long)upper)<<32)+ (lower&0xffffffffL)) } 8

  9. Buffer vtable b lock obj byte[] vtable lock obj 1024 1024 (2) Type Safety Objects have meta-data • runtime safety checks (array-bounds, array-store, casts) In C: struct Buffer { int len; char data[1];} Buffer *b = malloc(sizeof(Buffer)+1024); b.len = 1024; In Java: class Buffer { int len; byte[] data; Buffer(int n) { data = new byte[n]; len = n; } } Buffer b = new Buffer(1024); b 9

  10. NI (2) Type Safety Result: Java objects need to be serialized and de-serialized across the network Application Memory GC heap Native heap serial pin copy RAM DMA ON OFF 10

  11. (2) Type Safety: Effect Performance hit of one order of magnitude: • pick your favorite high-level communication abstraction (e.g. Remote Method Invocation) • pick your favorite JVM, your fastest network interface, and a pair of 450Mhz P-II 11

  12. NI Thesis Use explicit memory management to improve Java communication performance • Jbufs: safe and explicit management of Java buffers • softens the GC/Native heap separation • preserves type and storage safety • “zero-copy” array transfers • Jstreams: extends Jbufs for optimizing serialization in clusters • “zero-copy” de-serialization of arbitrary objects Application Memory GC heap Native heap pin RAM ON OFF DMA user-controlled 12

  13. Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: Giganet cluster and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 13

  14. Giganet Cluster Configuration • 8 P-II 450MHz, 128MB RAM • 8 1.25 Gbps Giganet GNN-1000 adapter • one Giganet switch GNN1000 Adapter: User-Level Network Interface • Virtual Interface Architecture implemented as a library (Win32 dll) Base-line pt-2-pt Performance • 14s r/t latency, 16s with switch • over 100MBytes/s peak, 85MBytes/s with switch 14

  15. Marmot Java System from Microsoft Research • not a VM • static compiler: bytecode (.class) to x86 (.asm) • linker: asm files + runtime libraries -> executable (.exe) • no dynamic loading of classes • most Dragon book opts, some OO and Java-specific opts Advantages • source code • good performance • two types of non-concurrent GC (copying, conservative) • native interface “close enough” to JNI 15

  16. Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: Giganet cluster and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 16

  17. GC heap byte array ref send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA Javia-I Basic Architecture • respects heap separation • buffer mgmt in native code • Marmot as an “off-the-shelf” system • copying GC disabled in native code • primitive array transfers only Send/Recv API • non-blocking • blocking • bypass ring accesses • pin-on-demand • alloc-recv: allocates new array on-demand • cannot eliminate copying during recv 17

  18. Javia-I: Performance Basic Costs (PII-450, Windows2000b3): pin + unpin = (10 + 10)us, or ~5000 machine cycles Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us Latency: N = transfer size in bytes 16.5us + (25ns) * N raw 38.0us + (38ns) * N pin(s) 21.5us + (42ns) * N copy(s) 18.0us + (55ns) * N copy(s)+alloc(r) BW: 75% to 85% of raw for 16Kbytes 18

  19. jbufs Goal • provide buffer management capabilities to Java without violating its safety properties • re-use is important: amortizes high pinning costs jbuf: exposes communication buffers to Java programmers 1. lifetime control: explicit allocation and de-allocation 2. efficient access: direct access as primitive-typed arrays 3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap • heap separation becomes soft and user-controlled 19

  20. jbufs: Lifetime Control public class jbuf { public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */ public void free() throws CannotFreeException; /* frees jbuf if it can */ } 1. jbuf allocation does not result in a Java reference to it • cannot access the jbuf from the wrapper object 2. jbuf is not automatically freed if there are no Java references to it • free has to be explicitly called handle jbuf GC heap 20

  21. jbufs: Efficient Access public class jbuf { /* alloc and free omitted */ public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/ public int[] toIntArray() throws TypedException; /*hands out int[] ref*/ . . . } 3. (Storage Safety) jbuf remains allocated as long as there are array references to it • when can we ever free it? 4. (Type Safety) jbuf cannot have two differently typed references to it at any given time • when can we ever re-use it (e.g. change its reference type)? jbuf Java byte[] ref GC heap 21

  22. jbuf jbuf jbuf Java byte[] ref Java byte[] ref Java byte[] ref GC heap GC heap GC heap unRef callBack jbufs: Location Control public class jbuf { /* alloc, free, toArrays omitted */ public void unRef(CallBack cb); /* app intends to free/re-use jbuf */ } Idea: Use GC to track references unRef: application claims it has no references into the jbuf • jbuf is added to the GC heap • GC verifies the claim and notifies application through callback • application can now free or re-use the jbuf Required GC support: change scope of GC heap dynamically 22

  23. jbufs: Runtime Checks to<p>Array, GC alloc to<p>Array Unref ref<p> free Type safety: ref and to-be-unref states parameterized by primitive type GC* transition depends on the type of garbage collector • non-copying: transition only if all refs to array are dropped before GC • copying: transition occurs after every GC unRef GC* to-be unref<p> to<p>Array, unRef 23

  24. GC heap send/recv ticket ring jbuf state array refs Vi Java C descriptor send/recv queue VIA Javia-II Exploiting jbufs • explicit pinning/unpinning of jbufs • only non-blocking send/recvs 24

  25. Basic Jbuf Costs allocation = 1.2us, to*Array = 0.8us, unRefs = 2.3 us, GC degradation=1.2us/jbuf Latency (n = xfer size) 16.5us + (0.025us) * n raw 20.5us + (0.025us) * n jbufs 38.0us + (0.038us) * n pin(s) 21.5us + (0.042us) * n copy(s) BW within 1% of raw Javia-II: Performance 25

  26. MM: Communication pMM over Javia-II/jbufs spends at least 25% less in communication for 256x256 matrices on 8 processors 26

  27. MM: Overall Cache effects: better communication performance does not always translate to better overall performance 27

  28. Active Messages class First extends AMHandler { private int first; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); first = tmp[0]; } } class Enqueue extends AMHandler { private Queue q; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); q.enq(tmp); } } Exercising Jbufs: • user supplies a list of jbufs • upon message arrival: • jbuf passed to handler • unRef is invoked after handler invocation • if pool is empty, reclaim existing ones • copying deferred to GC-time only if needed 28

  29. AM: Performance Latency about 15s higher than Javia • synch access to buffer pool, endpoint header, flow control checks, handler id lookup BW within 10% of peak for 16KByte messages 29

  30. Jbufs: Experience Efficient access through arrays is useful: • no indirect access via method invocation • promotes code re-use of large numerical kernels • leverages compiler infrastructure for eliminating safety checks Limitations • still not as flexible as C buffers • stale references may confuse programmers Discussed in thesis: • the necessity of explicit de-allocation • implementation of Jbufs in Marmot’s copying collector • impact on conservative and generational collector • extension to JNI to allow “portable” implementations of Jbufs 30

  31. Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: VI Architecture and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization on Homogeneous Clusters • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 31

  32. GC heap writeObject Object Serialization and RMI Standard JOS Protocol • “heavy-weight” class descriptors are serialized along with objects • type-checking: classes need not be “equal”, just “compatible.” • protocol allows for user extensions Remote Method Invocation • object-oriented version of Remote Procedure Call • relies on JOS for argument passing • actual parameter object can be a sub-class of the formal parameter class. GC heap readObject NETWORK 32

  33. JOS Costs 1. overheads in tens or hundreds of s: • send/recv overheads=~ 3 s, memcpy of 500 bytes=~ 0.8 s 2. double[] 50% more expensive than byte[] of similar size 3. overheads grow as object sizes grow 33

  34. Impact of Marmot Impact of Marmot’s optimizations: • Method inlining: up to 66% improvement (already deployed) • No synchronization whatsoever: up to 21% improvement • No safety checks whatsoever: up to 15% combined Better compilation technology unlikely to reduce overheads substantially 34

  35. Impact on RMI • Order of magnitude worse than Javia-I/II • round-trip latency drops to about 30us in a null RMI: no JOS! • peak bandwidth of 22MBytes/s, about 25% of raw 35

  36. Impact on Applications A Case for Specializing Serialization for Cluster applications: • overheads a order of magnitude higher than send/recv and memcpy • RMI performance degraded by one order of magnitude • 5-15% “estimated” impact on applications • old adage: “specialize for the common case” 36

  37. Optimizing De-serialization “in-place” object de-serialization • specialization for homogeneous cluster and JVMs Goal • eliminate copying and allocation of objects Challenges • preserve the integrity of the receiving JVM • permit de-serialization of arbitrary Java objects with unrestricted usage and without special annotations • independent of a particular GC scheme GC heap GC heap writeObject NETWORK 37

  38. Jstreams: write public class Jstream extends Jbuf { public void writeObject(Object o) /* serializes o onto the stream */ throws TypedException, ReferencedException; public void writeClear() /* clears the stream for writing*/ throws TypedException, ReferencedException; } writeObject • deep-copy of objects: maintains in-memory layout • deals with cyclic data structures • swizzle pointers: offsets to a base address • replace object meta-data with 64-bit class descriptor • optimization: primitive-typed arrays in jbufs are not copied 38

  39. GC heap GC heap GC heap unRef callBack Jstreams: read public class Jstream extends Jbuf { public Object readObject() throws TypedException; /* de-serialization */ public boolean isJstream(Object o); /* checks if o resides in the stream */ } readObject • replace class descriptors with meta-data • unswizzle pointers, array-bounds checking • after first readObject, add jstream to GC heap • tracks references coming out of read objects • unRef: user is willing to free or re-use 39

  40. writeObject, GC alloc writeObject Write Mode Unref free writeClear readObject GC* readObject to-be unref Read Mode unRef unRef readObject, GC jstreams: Runtime Checks Modification to Javia-II: prevent DMA from clobbering de-serialized objects • receive posts not allowed if jstream is in read mode • no changes to Javia-II architecture

  41. jstream: Performance De-serialization costs constant w.r.t. object size • 2.6us for arrays, 3.3us per list element. 41

  42. jstream: Impact on RMI 4-byte round-trip latency of 45us (25us higher than Javia-II) 52MBytes/s for 16KBytes arguments 42

  43. jstream: Impact on Applications 3-10% improvement in SOR, EM3D, FFT 10% hit in pMM performance • over 22,000 incoming RMIs, 1000 jstreams in receive pool, ~26 garbage collections: 15% of total execution time in GC • generational collection will alleviate GC costs substantially • receive pool size is hard to tune: tradeoffs between GC and locality 43

  44. Jstreams: Experience Implementation of readObject and writeObject integrated into JVM • protocol is JVM-specific • native implementation is faster Limitations • not as flexible as Java streams: cannot read and write at the same time • no “extensible” wire protocols Discussed in thesis: • implementation of Jstreams in Marmot’s copying collector • support for polymorphic RMI: minor changes to the stub compiler • JNI extensions to allow “portable” implementations of Jstreams 44

  45. Related Work Microsoft J-Direct • “pinned” arrays defined using source-level annotations • JIT produces code to “redirect” array access: expensive • Berkeley’s Jaguar: efficient code generation with JIT extensions • security concern: JIT “hacks” may break Java or byte-code Custom JVMs • many “tricks” are possible (e.g. pinned array factories, pinned and non-pinned heaps, etc): depend on a particular GC scheme • Jbufs: isolates minimal support needed from GC Memory Management • Safe Regions (Gay and Aiken): reference counting, no GC Fast Serialization and RMI • KaRMI (Karlsruhe): fixed JOS, ground-up RMI implementation • Manta (Vrije U): fast RMI but a Java dialect 45

  46. Summary Use of explicit memory management to improve Java communication performance in clusters • softens the GC/Native heap separation • preserves type and storage safety • independent of GC scheme • jbufs: zero-copy array transfers • jstreams: zero-copy de-serialization of arbitrary objects Framework for building communication software and applications in Java • Javia-I/II • parallel matrix multiplication • Jam: active messages • Java RMI • cluster applications: TSP, IDA, SOR, EM3D, FFT, and MM 46

More Related