1 / 42

Executing Parallel Programs with Potential Bottlenecks Efficiently

Executing Parallel Programs with Potential Bottlenecks Efficiently. University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa. Programs We Consider. programs updating shared data frequently with mutex operations. Context: Implementation of concurrent OO langs

Download Presentation

Executing Parallel Programs with Potential Bottlenecks Efficiently

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Executing Parallel Programs with Potential Bottlenecks Efficiently University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa

  2. Programs We Consider programs updating shared data frequently with mutex operations Context: Implementation of concurrent OO langs on SMPs and DSM machines …….. exclusive method exclusive method exclusive method exclusive method update! e.g., synchronized methods in Java update! bottleneck object (e.g., counter) update! update!

  3. Amdahl’s Law 90% → can execute in parallel 10% → must execute sequentially (bottleneck) int foo(…) { int x = 0, y = 0; parallel for (…) { ... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y; } 10 times speedup, at most You expect 10 times speedup but... Can you really gain 10 times speedup???

  4. Speedup Curvesfor Programs with Bottlenecks real time ideal # of PEs “Excessive” processors may be used!   ∵ It is difficult to predict dynamic behavior   ∵ Different phases need different num. of PEs

  5. Preliminary Experiments using aSimple Counter Program in C • Solaris threads & Ultra Enterprise 10000 • Each processor increments a shared counter in parallel The time didn’t remain constant, but increases dramatically.

  6. Goal • Efficient execution of programs with bottlenecks • Focusing on synchronization of methods time to execute a whole program in parallel time to execute only bottlenecks sequentially making closer other parts other parts bottleneck parts bottleneck parts 1PE 50PE

  7. naïve implementation other parts bottleneck parts ideal implementation other parts bottleneck parts Stop the increase of the time consumed in bottlenecks! What Problem Should We Solve? other parts bottleneck parts 1PE 50PE

  8. Put it in Prof. Ito’s terminology! • He aims at keeping: • the PP/M ≧ SP/S property • Our work aims at keeping: • the PP/M ≧ PP/S propertyPerformance on 100 PE should behigher than that on 1 PE!

  9. Presentation Overview • Examples of potential bottlenecks • Two naïve schemes and their problems • Local-based execution • Owner-based execution • Our scheme • Detachment of requests • Priority mechanism using compare-and-swap • Two compile-time optimizations • Performance evaluation & Related work

  10. Examples ofPotential Bottleneck Objects • Objects introduced to easily reuse MT-unsafe functions in MT env. • Abstract I/O objects • Stubs in distributed systems • One stub conducts all communications in a site • Shared global variables • e.g., counters to collect statistics information It is sometimes difficult to eliminate them.

  11. Local-based Execution(e.g., Implementation with Spin-locks) method method method method method method Each PE executes methods by itself ↓ Each PE references/updates an object by itself instance variables object Advantage: No need to move “computation” Disadvantage: Cache misses when referencing an object (due to invalidation/update of cache by other processors)

  12. Confirmation of Overheadin Local-based Execution C program on Ultra Enterprise 10000 • Overhead of referencing/updating an object • increases according to the increase of PEs • occupies 1/3 of whole exec. time on 60 PEs

  13. Owner-based Execution owner = a processor holding an object’s lock currently owner present → creates and inserts a request owner absent → becomes an owner and executes a method object owner non-owners a request (a data structure containing method info) =

  14. Dequeued • one by one • with aux. locks Owner-based Executionwith Simple Blocking Locks instance variables object One processor likely executes multiple methods consecutively

  15. Advantages/Disadvantagesof Owner-based Execution Advantage: Less cache misses to reference an object Disadvantages: • Overhead to move “computation” • Synchronization operations for a queue • Waiting time to manipulate a queue • Cache misses to read requests • (focusing on owner’s execution, • which typically gives a critical path) Can they be reduced?

  16. Overview of Our Scheme • Improve simple blocking locks • Detach requests • Reduce the frequency of mutex operations • Give high priority to owner • Reduce the time required to take control of requests • Prefetch requests • Reduce cache misses in reading requests Our scheme is realized implicitly by a compiler and runtime of a concurrent object-oriented language Schematic

  17. Data Structures • Requests are managed with a list • 1-word pointer area (lock area) is added to each object • Non-owner: creates and inserts a request • Owner: picks requests out and execute them object

  18. Battle in Bottleneck: 1 owner vs. 99 non-owners We should help him! Design Policy • Owner’s behavior determines a critical path • We make owner’s execution fast, above all • We allow non-owners’ execution to be slow

  19. Non-owners Inserting a Request X Y Z object A B C

  20. Update with compare-and-swap Retry if interrupted Non-owners Inserting a Request X Y Z object A B C

  21. Non-owners Inserting a Request ♪ Non-owners repeat the loop until success X Y Z object A B C Update with compare-and-swap Retry if interrupted

  22. Owner Detaching Requests object Y A B C • Important • A whole list is detached • Update with swap always succeeds     → owner is never interrupted by other processors

  23. Owner Detaching Requests ♪ object Y A B C • Important • A whole list is detached • Update with swap always succeeds     → owner is never interrupted by other processors

  24. X Z inserting requests without disturbing owner Owner Executing Requests executed in turn without mutex ops Y A B C object 1. No synchronization operations by owner

  25. Giving Higher Priority to Owner • Insertion by non-owner (compare-and-swap):may fail many times • Detachment by owner (swap):always succeeds in constant steps 2. Owner never spins to manipulate requests

  26. Compile-time Optimization (1/2) • Prefetch requests ... while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next; } ... while this request is processed the request is prefetched 3. Reduce cache misses to read requests

  27. Compile-time Optimization (2/2) • Caching instance variables in registers • Non-owners do not reference/update an objectwhile detached requests are processed passing IVs in registers object Two versions of code are provided for one method Code to process requests:     uses instance variables on memory Code to execute methods directly: uses instance variables in registers

  28. Achieving Similar Effectsin Low-level Languages (e.g., in C) • “Always spin-lock” approach • Waste of CPU cycles, memory bandwidth • Deadlocks • “Finding bottlenecks→rewriting code” approach • Implements owner-based execution only in bottlenecks • Harder than “support of high-level lang” approach • Implementing owner-based execution is troublesome • Bottlenecks appear dynamically in some programs

  29. Experimental Results (1/2)

  30. Experimental Results (2/2)

  31. Interesting Results using aSimple Counter Program in C • Simple blocking locks:waiting time was the largest overhead • 70 % of owner’s whole execution time • Our scheme is efficient also on uniprocessor • Spin-locks: 641 msec • Simple blocking locks: 1025 msec • Our scheme: 810 msec(execution time)

  32. Related Work (1/3)- execution of methods invoked in parallel - • ICC++ [Chien et al. 96] • Detects nonexclusive methods through static analysis • Concurrent Aggregates [Chien 91] • Realizes interleaving through explicit programming • Cooperative Technique [Barnes 93] • PE entering critical section later “helps” predecessors • Focus on exposing parallelism among nonexclusive operations • No remark on performance loss in bottlenecks

  33. Related Work (2/3)- efficient spin-locks when contention occurs - • MCS Lock [Mellor-Crummey et al. 91] • Provides spin area for each processor • Exponential Backoff [Anderson 90] • Is heuristics to “withdraw” processors which failed in lock acquisition • Needs some skills to determine parameters These locks give local-based execution → Low locality in referencing bottleneck objects

  34. Related Work (3/3)- efficient Java monitors - • Bimodal object-locking[Onodera et al. 98],Thin Locks [Bacon et al. 98] • Affected our low-level implementation • Uses unoptimized “fat locks” in contended objects • Meta-locks [Agesen et al. 99] • Clever technique similar to MCS locks • No busy-waiting even in contended objects • Their primary concern lies on uncontended cases • They do not take locality of object references into account

  35. Summary • Serious performance loss in existing schemes • spin-locks: low locality of object references • blocking locks: overhead in contended request queue • Very fast execution in contended objects • Highly-optimized owner-based execution • Excellent Performance • Several times faster than simple schemes! (several hundred percent speedup!)

  36. Future Work • Solving a problem to use large memory in some cases • A long list of requests may be formed • The problem is common to owner-based schemes • This work focused on time-efficiency, not on space-efficiency • Simple solution: memory used for requests ≧ some threshold        ⇒ dynamic switch to local-based execution • Increasing/decreasing PEs according to exec. status • System automatically decides the “best” number of PEsfor each program point • It eliminates the existence of excessive processors itself

  37. ここからは質問タイムに見せるスライド • ここからは質問タイムに見せるスライド

  38. More Detailed Measurementsusing a Counter Program in C • Solaris threads & Sun Ultra Enterprise 10000 • Each processor increments a shared counter

  39. No guarantee of FIFO order • The method invoked later may beexecuted earlier • Simple solution: “reverse” detached requests • Better solution: • Can we use a queue, instead of list? • Are 64bit compare-and-swap/swap necessary?

More Related