The Millipede Project: A Distributed Virtual Parallel Machine for Non-Dedicated Environments

The MILLIPEDE ProjectTechnion, Israel Windows-NT based Distributed Virtual Parallel Machine http://www.cs.technion.ac.il/Labs/Millipede

What is Millipede ? A strong Virtual Parallel Machine: employ non-dedicated distributed environments Programs Implementation of Parallel Programming Langs Distributed Environment

Programming Paradigms SPLASH Cilk/Calipso CC++ Other Java CParPar ParC ParFortran90 “Bare Millipede” Events Mechanism (MJEC) Migration Services (MGS) Millipede Layer Distributed Shared Memory (DSM) Communication Packages U-Net, Transis, Horus,… Operating System Services Communication, Threads, Page Protection, I/O Software Packages User-mode threads

So, what’s in a VPM? • Check list: • Using non-dedicated cluster of PCs (+ SMPs) • Multi-threaded • Shared memory • User-mode • Strong support for weak memory • Dynamic page- and job-migration • Load sharing for maximal locality of reference • Convergence to optimal level of parallelism Millipede inside Millipede inside

Using a non-dedicated cluster Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user Co-existence of several parallel applications

Multi-Threaded Environments • Well known: • Better utilization of resources • An intuitive and high level of abstraction • Latency hiding by comp. and comm. overlap • Natural for parallel programing paradigms & environments • Programmer defined max-level of parallelism • Actual level of parallelism set dynamically. Applications scale up and down • Nested parallelism • SMPs

Convergence to Optimal Speedup The Tradeoff:Higher level of parallelismVS. Better locality of memory reference Optimal speedup - not necessarily with the maximal number of computers Achieved level of parallelism - depends on the program needs and onthe capabilities of the system

No/Explicit/Implicit Access Shared Memory PVM C-Linda /* Receive data from master */ /* Retrieve data from DSM */ msgtype = 0; rd (“ init data”, ? nproc, ?n, ?data); pvm_recv (-1, msgtype); pvm_upkint (& nproc, 1, 1); pvm_upkint ( tids, nproc, 1); pvm_upkint (&n, 1, 1); pvm_upkfloat (data, n, 1); /* Worker id is given at creation /* Determine which slave I am no need to compute it now */ (0..nproc-1)*/ for(i=0; i< nproc; i++) if( mytid==tids[i]) { me=i; break;} /* do calculation. put result in DSM*/ /* Do calculations with data*/ (“result”, id, work(id, n, data, nproc) ); out result=work(me, n, data, tids, nproc); /* send result to master */ pvm_initsend ( PvmDataDefault); “Bare” Millipede pvm_pkint (&me, 1, 1); pvm_pkfloat (&result, 1, 1); msg_type = 5; result = work( milGetMyId(), master = pvm_paremt (); n, data, pvm_send (master, msgtype); milGetTotalIds()) ; /* Exit PVM before stopping */ pvm_exit ();

Relaxed Consistency(Avoiding false sharing and ping pong) page copies Sequential, CRUW, Sync(var), Arbitrary-CW Sync Multiple relaxations for different shared variables within the same program No broadcast, no central address servers (so can work efficiently interconnected LANs) New protocols welcome (user defined?!) Step-by-step optimization towards maximal parallelism

LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):

MJEC - Millipede Job Event Control An open mechanism with which various synchronization methods can be implemented • A job has a unique systemwide id • Jobs communicate and synchronize by sending events • Although a job is mobile, its events follow and reach itsevents queuewherever it goes • Event handlers are context-sensitive

MJEC (con’t) • Modes: • In Execution-Mode: arriving events are enqueued • In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine

MJEC Interface Execution Mode milEnterDispatchingMode(func, context) ret := func(INIT, context) Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo, void *context) Post Event: milPostEvent(id target, int event, int data) Dispatcher Routine Syntax: int foo(id origin, int event, int data, void *context) No Yes ret==EXIT? ret := func(event, context) event pending? Yes ret := func(EXIT, context) Wait for event

Experience with MJEC • ParC:~ 250 linesSPLASH: ~ 120 lines • Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers • Implementation of location-dependent services (e.g., graphical display)

Example - Barriers with MJEC Barrier Server Barrier() { milPostEvent(BARSERV, ARR, 0); milEnterDispatchingMode(wait_in_barrier, 0); } wait_in_barrier(src, event, context) { if (event == DEP) return EXIT_DISPATCHER; else return STAY_IN_DISPATCHER; } Dispatcher: … … ... EVENT ARR Job Job BARRIER(...) … … … Dispatcher: … …

Example - Barriers with MJEC (con’t) BarrierServer() { milEnterDispatchingMode(barrier_server, info); } barrier_server(src, event, context) { if (event == ARR) enqueue(context.queue, src); if (should_release(context)) while(context.cnt>0) { milPostEvent(context.dequeue, DEP); } return STAY_IN_DISPATCHER; } Barrier Server Dispatcher: … … ... EVENT DEP EVENT DEP Job Job BARRIER(...) BARRIER(...) Dispatcher: … … Dispatcher: … …

Dynamic Page- and Job-Migration • Migration may occur in case of: • Remote memory access • Load imbalance • User comes back from lunch • Improving locality by location rearrangement • Sometimes migration should be disabled • by system: ping-pong, critical section • by programmer: control system

Locality of memory reference is THE dominant efficiency factorMigration Can Help Locality: Only Job Migration Only Page Migration Page & Job Migration

Load Sharing + Max. Locality = Minimum-Weight multiway cut p q p q r r

Problems with themultiway cut model • NP-hard for #cuts>2. We haven>X,000,000. Polynomial 2-approximations known • Not optimized for load balancing • Page replica • Graph changes dynamically • Only external accesses are recorded ===> only partial information is available

Our Approach Access page 1 page 2 page 1 page 0 • Record the history of remote accesses • Use this information when taking decisions concerning load balancing/load sharing • Save old information to avoid repeating bad decisions (learn from mistakes) • Detect and solve ping-pong situations • Do everything by piggybacking on communication that is taking place anyway

Ping Pong Detection (local): 1. Local threads attempt to use the page short time after it leaves the local host 2. The page leaves the host shortly after arrival Treatment (by ping-pong server): • Collect information regarding all participating hosts and threads • Try to locate an underloaded target host • Stabilize the system by locking-in pages/threads

Optimization TSP - Effect of Locality 15 cities, Bare Millipede sec 4000 NO-FS 3500 OPTIMIZED-FS 3000 FS 2500 2000 1500 1000 500 0 1 2 3 4 5 6 hosts In the NO - FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED - FS the history mechanism is enabled.

TSP on 6 hosts k number of threads falsely sharing a page k optimized? # DSM - # ping - pong Number of execution related treatment msgs thread time (sec) messages migrations 2 Yes 5100 290 68 645 2 No 176120 0 23 1020 3 Yes 4080 279 87 620 3 No 160460 0 32 1514 4 Yes 5060 343 99 690 4 No 155540 0 44 1515 5 Yes 6160 443 139 700 5 No 162505 0 55 1442

TSP-1 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Best results are achieved at maximal sensitivity, since all pages are accessed frequently. TSP-2 1100 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Since part of the pages are accessed frequently and part - only occasionally, maximal sensitivity causes unnecessary ping pong treatment and significantly increases execution time. Ping Pong Detection Sensitivity

Applications • Numerical computations: Multigrid • Model checking: BDDs • Compute-intensive graphics: Ray-Tracing, Radiosity • Games, Search trees, Pruning, Tracking, CFD ...

Performance Evaluation L - underloaded H - overloaded Delta(ms) - lock in time t/o delta - polling (MGS,DSM) msg delta - system pages delta T_epoch - max history time ??? - remove old histories - refresh old histories L_epoch - histories length page histories vs. job histories migration heuristic - which func? ping-pong - - what is initial noise? - what freq. is PP?

LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host

LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the cachingeffect.

Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC

Overhead of ParC/Millipede on a single host. Testing with Tracking algorithm:

Info... http://www.cs.technion.ac.il/Labs/Millipede millipede@cs.technion.ac.il Release available at the Millipede site !

The Millipede Project: A Distributed Virtual Parallel Machine for Non-Dedicated Environments

The Millipede Project: A Distributed Virtual Parallel Machine for Non-Dedicated Environments

Presentation Transcript

Windows NT/2000

Windows NT Scalability

Windows NT Server

Parallel Virtual Machine

Windows NT 4.0

Windows NT Networking

PVM: Parallel Virtual Machine

PVM (Parallel Virtual Machine) ‏

Lecture 5: Parallel Virtual Machine (PVM)

Windows NT

Windows NT

Windows NT

Windows NT

Windows NT A Distributed Architecture

Windows-NT based Distributed Virtual Parallel Machine

NReduce: A Distributed Virtual Machine for Parallel Graph Reduction

Windows NT Based Web Security

Windows NT Scalability