1 / 32

Windows-NT based Distributed Virtual Parallel Machine

The MILLIPEDE Project Technion, Israel. Windows-NT based Distributed Virtual Parallel Machine. http://www.cs.technion.ac.il/Labs/Millipede. What is Millipede ?. A strong Virtual Parallel Machine: employ non-dedicated distributed environments. Programs.

jalena
Download Presentation

Windows-NT based Distributed Virtual Parallel Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The MILLIPEDE ProjectTechnion, Israel Windows-NT based Distributed Virtual Parallel Machine http://www.cs.technion.ac.il/Labs/Millipede

  2. What is Millipede ? A strong Virtual Parallel Machine: employ non-dedicated distributed environments Programs Implementation of Parallel Programming Langs Distributed Environment

  3. Programming Paradigms SPLASH Cilk/Calipso CC++ Other Java CParPar ParC ParFortran90 “Bare Millipede” Events Mechanism (MJEC) Migration Services (MGS) Millipede Layer Distributed Shared Memory (DSM) Communication Packages U-Net, Transis, Horus,… Operating System Services Communication, Threads, Page Protection, I/O Software Packages User-mode threads

  4. So, what’s in a VPM? • Check list: • Using non-dedicated cluster of PCs (+ SMPs) • Multi-threaded • Shared memory • User-mode • Strong support for weak memory • Dynamic page- and job-migration • Load sharing for maximal locality of reference • Convergence to optimal level of parallelism Millipede inside Millipede inside

  5. Using a non-dedicated cluster Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user Co-existence of several parallel applications

  6. Multi-Threaded Environments • Well known: • Better utilization of resources • An intuitive and high level of abstraction • Latency hiding by comp. and comm. overlap • Natural for parallel programing paradigms & environments • Programmer defined max-level of parallelism • Actual level of parallelism set dynamically. Applications scale up and down • Nested parallelism • SMPs

  7. Convergence to Optimal Speedup The Tradeoff:Higher level of parallelismVS. Better locality of memory reference Optimal speedup - not necessarily with the maximal number of computers Achieved level of parallelism - depends on the program needs and onthe capabilities of the system

  8. No/Explicit/Implicit Access Shared Memory PVM C-Linda /* Receive data from master */ /* Retrieve data from DSM */ msgtype = 0; rd (“ init data”, ? nproc, ?n, ?data); pvm_recv (-1, msgtype); pvm_upkint (& nproc, 1, 1); pvm_upkint ( tids, nproc, 1); pvm_upkint (&n, 1, 1); pvm_upkfloat (data, n, 1); /* Worker id is given at creation /* Determine which slave I am no need to compute it now */ (0..nproc-1)*/ for(i=0; i< nproc; i++) if( mytid==tids[i]) { me=i; break;} /* do calculation. put result in DSM*/ /* Do calculations with data*/ (“result”, id, work(id, n, data, nproc) ); out result=work(me, n, data, tids, nproc); /* send result to master */ pvm_initsend ( PvmDataDefault); “Bare” Millipede pvm_pkint (&me, 1, 1); pvm_pkfloat (&result, 1, 1); msg_type = 5; result = work( milGetMyId(), master = pvm_paremt (); n, data, pvm_send (master, msgtype); milGetTotalIds()) ; /* Exit PVM before stopping */ pvm_exit ();

  9. Relaxed Consistency(Avoiding false sharing and ping pong) page copies Sequential, CRUW, Sync(var), Arbitrary-CW Sync Multiple relaxations for different shared variables within the same program No broadcast, no central address servers (so can work efficiently interconnected LANs) New protocols welcome (user defined?!) Step-by-step optimization towards maximal parallelism

  10. LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):

  11. MJEC - Millipede Job Event Control An open mechanism with which various synchronization methods can be implemented • A job has a unique systemwide id • Jobs communicate and synchronize by sending events • Although a job is mobile, its events follow and reach itsevents queuewherever it goes • Event handlers are context-sensitive

  12. MJEC (con’t) • Modes: • In Execution-Mode: arriving events are enqueued • In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine

  13. MJEC Interface Execution Mode milEnterDispatchingMode(func, context) ret := func(INIT, context) Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo, void *context) Post Event: milPostEvent(id target, int event, int data) Dispatcher Routine Syntax: int foo(id origin, int event, int data, void *context) No Yes ret==EXIT? ret := func(event, context) event pending? Yes ret := func(EXIT, context) Wait for event

  14. Experience with MJEC • ParC:~ 250 linesSPLASH: ~ 120 lines • Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers • Implementation of location-dependent services (e.g., graphical display)

  15. Example - Barriers with MJEC Barrier Server Barrier() { milPostEvent(BARSERV, ARR, 0); milEnterDispatchingMode(wait_in_barrier, 0); } wait_in_barrier(src, event, context) { if (event == DEP) return EXIT_DISPATCHER; else return STAY_IN_DISPATCHER; } Dispatcher: … … ... EVENT ARR Job Job BARRIER(...) … … … Dispatcher: … …

  16. Example - Barriers with MJEC (con’t) BarrierServer() { milEnterDispatchingMode(barrier_server, info); } barrier_server(src, event, context) { if (event == ARR) enqueue(context.queue, src); if (should_release(context)) while(context.cnt>0) { milPostEvent(context.dequeue, DEP); } return STAY_IN_DISPATCHER; } Barrier Server Dispatcher: … … ... EVENT DEP EVENT DEP Job Job BARRIER(...) BARRIER(...) Dispatcher: … … Dispatcher: … …

  17. Dynamic Page- and Job-Migration • Migration may occur in case of: • Remote memory access • Load imbalance • User comes back from lunch • Improving locality by location rearrangement • Sometimes migration should be disabled • by system: ping-pong, critical section • by programmer: control system

  18. Locality of memory reference is THE dominant efficiency factorMigration Can Help Locality: Only Job Migration Only Page Migration Page & Job Migration

  19. Load Sharing + Max. Locality = Minimum-Weight multiway cut p q p q r r

  20. Problems with themultiway cut model • NP-hard for #cuts>2. We haven>X,000,000. Polynomial 2-approximations known • Not optimized for load balancing • Page replica • Graph changes dynamically • Only external accesses are recorded ===> only partial information is available

  21. Our Approach Access page 1 page 2 page 1 page 0 • Record the history of remote accesses • Use this information when taking decisions concerning load balancing/load sharing • Save old information to avoid repeating bad decisions (learn from mistakes) • Detect and solve ping-pong situations • Do everything by piggybacking on communication that is taking place anyway

  22. Ping Pong Detection (local): 1. Local threads attempt to use the page short time after it leaves the local host 2. The page leaves the host shortly after arrival Treatment (by ping-pong server): • Collect information regarding all participating hosts and threads • Try to locate an underloaded target host • Stabilize the system by locking-in pages/threads

  23. Optimization TSP - Effect of Locality 15 cities, Bare Millipede sec 4000 NO-FS 3500 OPTIMIZED-FS 3000 FS 2500 2000 1500 1000 500 0 1 2 3 4 5 6 hosts In the NO - FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED - FS the history mechanism is enabled.

  24. TSP on 6 hosts k number of threads falsely sharing a page k optimized? # DSM - # ping - pong Number of execution related treatment msgs thread time (sec) messages migrations 2 Yes 5100 290 68 645 2 No 176120 0 23 1020 3 Yes 4080 279 87 620 3 No 160460 0 32 1514 4 Yes 5060 343 99 690 4 No 155540 0 44 1515 5 Yes 6160 443 139 700 5 No 162505 0 55 1442

  25. TSP-1 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Best results are achieved at maximal sensitivity, since all pages are accessed frequently. TSP-2 1100 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Since part of the pages are accessed frequently and part - only occasionally, maximal sensitivity causes unnecessary ping pong treatment and significantly increases execution time. Ping Pong Detection Sensitivity

  26. Applications • Numerical computations: Multigrid • Model checking: BDDs • Compute-intensive graphics: Ray-Tracing, Radiosity • Games, Search trees, Pruning, Tracking, CFD ...

  27. Performance Evaluation L - underloaded H - overloaded Delta(ms) - lock in time t/o delta - polling (MGS,DSM) msg delta - system pages delta T_epoch - max history time ??? - remove old histories - refresh old histories L_epoch - histories length page histories vs. job histories migration heuristic - which func? ping-pong - - what is initial noise? - what freq. is PP?

  28. LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host

  29. LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the cachingeffect.

  30. Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC

  31. Overhead of ParC/Millipede on a single host. Testing with Tracking algorithm:

  32. Info... http://www.cs.technion.ac.il/Labs/Millipede millipede@cs.technion.ac.il Release available at the Millipede site !

More Related