320 likes | 438 Views
The Millipede Project at Technion, Israel, develops a robust Virtual Parallel Machine (VPM) that utilizes non-dedicated distributed environments, such as clusters of PCs and symmetrical multiprocessors (SMPs). It supports multi-threaded parallel programming paradigms and includes features such as shared memory, dynamic job migration, and load sharing to optimize resource utilization. Millipede provides an intuitive abstraction for programmers, enabling easy implementation of parallel applications while ensuring efficient communication and synchronization through innovative mechanisms like the Millipede Job Event Control (MJEC).
E N D
The MILLIPEDE ProjectTechnion, Israel Windows-NT based Distributed Virtual Parallel Machine http://www.cs.technion.ac.il/Labs/Millipede
What is Millipede ? A strong Virtual Parallel Machine: employ non-dedicated distributed environments Programs Implementation of Parallel Programming Langs Distributed Environment
Programming Paradigms SPLASH Cilk/Calipso CC++ Other Java CParPar ParC ParFortran90 “Bare Millipede” Events Mechanism (MJEC) Migration Services (MGS) Millipede Layer Distributed Shared Memory (DSM) Communication Packages U-Net, Transis, Horus,… Operating System Services Communication, Threads, Page Protection, I/O Software Packages User-mode threads
So, what’s in a VPM? • Check list: • Using non-dedicated cluster of PCs (+ SMPs) • Multi-threaded • Shared memory • User-mode • Strong support for weak memory • Dynamic page- and job-migration • Load sharing for maximal locality of reference • Convergence to optimal level of parallelism Millipede inside Millipede inside
Using a non-dedicated cluster Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user Co-existence of several parallel applications
Multi-Threaded Environments • Well known: • Better utilization of resources • An intuitive and high level of abstraction • Latency hiding by comp. and comm. overlap • Natural for parallel programing paradigms & environments • Programmer defined max-level of parallelism • Actual level of parallelism set dynamically. Applications scale up and down • Nested parallelism • SMPs
Convergence to Optimal Speedup The Tradeoff:Higher level of parallelismVS. Better locality of memory reference Optimal speedup - not necessarily with the maximal number of computers Achieved level of parallelism - depends on the program needs and onthe capabilities of the system
No/Explicit/Implicit Access Shared Memory PVM C-Linda /* Receive data from master */ /* Retrieve data from DSM */ msgtype = 0; rd (“ init data”, ? nproc, ?n, ?data); pvm_recv (-1, msgtype); pvm_upkint (& nproc, 1, 1); pvm_upkint ( tids, nproc, 1); pvm_upkint (&n, 1, 1); pvm_upkfloat (data, n, 1); /* Worker id is given at creation /* Determine which slave I am no need to compute it now */ (0..nproc-1)*/ for(i=0; i< nproc; i++) if( mytid==tids[i]) { me=i; break;} /* do calculation. put result in DSM*/ /* Do calculations with data*/ (“result”, id, work(id, n, data, nproc) ); out result=work(me, n, data, tids, nproc); /* send result to master */ pvm_initsend ( PvmDataDefault); “Bare” Millipede pvm_pkint (&me, 1, 1); pvm_pkfloat (&result, 1, 1); msg_type = 5; result = work( milGetMyId(), master = pvm_paremt (); n, data, pvm_send (master, msgtype); milGetTotalIds()) ; /* Exit PVM before stopping */ pvm_exit ();
Relaxed Consistency(Avoiding false sharing and ping pong) page copies Sequential, CRUW, Sync(var), Arbitrary-CW Sync Multiple relaxations for different shared variables within the same program No broadcast, no central address servers (so can work efficiently interconnected LANs) New protocols welcome (user defined?!) Step-by-step optimization towards maximal parallelism
LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):
MJEC - Millipede Job Event Control An open mechanism with which various synchronization methods can be implemented • A job has a unique systemwide id • Jobs communicate and synchronize by sending events • Although a job is mobile, its events follow and reach itsevents queuewherever it goes • Event handlers are context-sensitive
MJEC (con’t) • Modes: • In Execution-Mode: arriving events are enqueued • In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine
MJEC Interface Execution Mode milEnterDispatchingMode(func, context) ret := func(INIT, context) Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo, void *context) Post Event: milPostEvent(id target, int event, int data) Dispatcher Routine Syntax: int foo(id origin, int event, int data, void *context) No Yes ret==EXIT? ret := func(event, context) event pending? Yes ret := func(EXIT, context) Wait for event
Experience with MJEC • ParC:~ 250 linesSPLASH: ~ 120 lines • Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers • Implementation of location-dependent services (e.g., graphical display)
Example - Barriers with MJEC Barrier Server Barrier() { milPostEvent(BARSERV, ARR, 0); milEnterDispatchingMode(wait_in_barrier, 0); } wait_in_barrier(src, event, context) { if (event == DEP) return EXIT_DISPATCHER; else return STAY_IN_DISPATCHER; } Dispatcher: … … ... EVENT ARR Job Job BARRIER(...) … … … Dispatcher: … …
Example - Barriers with MJEC (con’t) BarrierServer() { milEnterDispatchingMode(barrier_server, info); } barrier_server(src, event, context) { if (event == ARR) enqueue(context.queue, src); if (should_release(context)) while(context.cnt>0) { milPostEvent(context.dequeue, DEP); } return STAY_IN_DISPATCHER; } Barrier Server Dispatcher: … … ... EVENT DEP EVENT DEP Job Job BARRIER(...) BARRIER(...) Dispatcher: … … Dispatcher: … …
Dynamic Page- and Job-Migration • Migration may occur in case of: • Remote memory access • Load imbalance • User comes back from lunch • Improving locality by location rearrangement • Sometimes migration should be disabled • by system: ping-pong, critical section • by programmer: control system
Locality of memory reference is THE dominant efficiency factorMigration Can Help Locality: Only Job Migration Only Page Migration Page & Job Migration
Load Sharing + Max. Locality = Minimum-Weight multiway cut p q p q r r
Problems with themultiway cut model • NP-hard for #cuts>2. We haven>X,000,000. Polynomial 2-approximations known • Not optimized for load balancing • Page replica • Graph changes dynamically • Only external accesses are recorded ===> only partial information is available
Our Approach Access page 1 page 2 page 1 page 0 • Record the history of remote accesses • Use this information when taking decisions concerning load balancing/load sharing • Save old information to avoid repeating bad decisions (learn from mistakes) • Detect and solve ping-pong situations • Do everything by piggybacking on communication that is taking place anyway
Ping Pong Detection (local): 1. Local threads attempt to use the page short time after it leaves the local host 2. The page leaves the host shortly after arrival Treatment (by ping-pong server): • Collect information regarding all participating hosts and threads • Try to locate an underloaded target host • Stabilize the system by locking-in pages/threads
Optimization TSP - Effect of Locality 15 cities, Bare Millipede sec 4000 NO-FS 3500 OPTIMIZED-FS 3000 FS 2500 2000 1500 1000 500 0 1 2 3 4 5 6 hosts In the NO - FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED - FS the history mechanism is enabled.
TSP on 6 hosts k number of threads falsely sharing a page k optimized? # DSM - # ping - pong Number of execution related treatment msgs thread time (sec) messages migrations 2 Yes 5100 290 68 645 2 No 176120 0 23 1020 3 Yes 4080 279 87 620 3 No 160460 0 32 1514 4 Yes 5060 343 99 690 4 No 155540 0 44 1515 5 Yes 6160 443 139 700 5 No 162505 0 55 1442
TSP-1 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Best results are achieved at maximal sensitivity, since all pages are accessed frequently. TSP-2 1100 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Since part of the pages are accessed frequently and part - only occasionally, maximal sensitivity causes unnecessary ping pong treatment and significantly increases execution time. Ping Pong Detection Sensitivity
Applications • Numerical computations: Multigrid • Model checking: BDDs • Compute-intensive graphics: Ray-Tracing, Radiosity • Games, Search trees, Pruning, Tracking, CFD ...
Performance Evaluation L - underloaded H - overloaded Delta(ms) - lock in time t/o delta - polling (MGS,DSM) msg delta - system pages delta T_epoch - max history time ??? - remove old histories - refresh old histories L_epoch - histories length page histories vs. job histories migration heuristic - which func? ping-pong - - what is initial noise? - what freq. is PP?
LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host
LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the cachingeffect.
Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC
Overhead of ParC/Millipede on a single host. Testing with Tracking algorithm:
Info... http://www.cs.technion.ac.il/Labs/Millipede millipede@cs.technion.ac.il Release available at the Millipede site !