1 / 52

Communication Optimizations for Parallel Computing Using Data Access Information

This talk discusses communication optimizations for parallel computing, focusing on replication, locality, broadcast, and concurrent fetches. The goal is to reduce communication overhead and improve performance.

annamariaj
Download Presentation

Communication Optimizations for Parallel Computing Using Data Access Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication Optimizations for Parallel Computing Using Data Access Information Martin Rinard Department of Computer Science University of California,Santa Barbara martin@cs.ucsb.edu http://www.cs.ucsb.edu/~martin

  2. Motivation Communication Overhead Can Substantially Degrade the Performance of Parallel Computations

  3. Communication Optimizations Replication Locality Broadcast Concurrent Fetches Latency Hiding

  4. Applying Optimizations • Language Implementation: • Automatically • Reduces Programming Burden • No Portability Problems -Each Implementation Optimized for Current Hardware Platform Programmer: By Hand • Programming Burden • Portability Problems

  5. Key Questions • How does the implementation get the information it needs to apply the communication optimizations? • What communication optimization algorithms does the implementation use? • How well do the optimized computations perform?

  6. Goal of Talk Present Experience Automatically Applying Communication Optimizations in Jade

  7. Talk Outline • Jade Language • Message Passing Implementation • Communication Optimization Algorithms • Experimental Results on iPSC/860 • Shared Memory Implementation • Communication Optimization Algorithms • Experimental Results on Stanford DASH • Conclusion

  8. Jade • Portable, Implicitly Parallel Language • Data Access Information • Programmer starts with serial program • Uses Jade constructs to provide information about how parts of program access data • Jade Implementation Uses Data Access Information to Automatically • Extract Concurrency • Synchronize Computation • Apply Communication Optimizations

  9. Jade Concepts • Shared Objects • Tasks • Access Specifications shared object references withonly { } do () { computation that reads and writes } rd ; wr ; access specification task

  10. rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

  11. Jade Example withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

  12. rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

  13. rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

  14. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ;

  15. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ;

  16. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  17. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  18. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  19. rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  20. rd ; rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  21. rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

  22. Result • At Each Point in the Execution • A Collection of Enabled Tasks • Each Task Has an Access Specification • Jade Implementation • Exploits Information in Access Specifications • Apply Communication Optimizations

  23. Message Passing Implementation • Model of Computation for Implementation • Implementation Overview • Communication Optimizations • Experimental Results for iPSC/860

  24. Model of Computation • Each Processor Has a Private Memory • Processors Communicate by Sending Messages through Network network memory processor

  25. Implementation Overview Distributes Objects Across Memories

  26. Implementation Overview Assigns Enabled Tasks to Idle Processors rd ; wr ;

  27. Implementation Overview Transfers Objects to Accessing Processor Replicates Objects that Task will Read rd ; wr ;

  28. Implementation Overview Transfers Objects to Accessing Processor Migrates Objects that Task will Write rd ; wr ;

  29. Implementation Overview When all Remote Objects Arrive Task Executes rd ; wr ;

  30. Optimization Goal Mechanism Adaptive Broadcast Parallelize Communication Broadcast Each New Version of Widely Accessed Objects Replication Enable Tasks to Concurrently Read Same Data Replicate Data on Reading Processors Latency Hiding Overlap Computation and Communication Assign Multiple Enabled Tasks to Same Processor Concurrent Fetch Parallelize Communication Concurrently Transfer Remote Objects that Task will Access Locality Eliminate Communication Execute Tasks on Processors that have Locally Available Copies of Accessed Objects

  31. Application-Based Evaluation Water:Evaluates forces and potentials in a system of liquid water molecules String:Computes a velocity model of the geology between two oil wells Ocean:Simulates the role of eddy and boundary currents in influencing large-scale ocean movements Panel Cholesky:Sparse Cholesky factorization algorithm

  32. Impact of Communication Optimizations Panel Cholesky Water String Ocean - + Adaptive Broadcast Replication - Latency Hiding - Concurrent Fetch + Significant Impact - Negligible Impact Required To Expose Concurrency

  33. Locality Optimization • Integrated into Online Scheduler • Scheduler • Maintains Pool of Enabled Tasks • Maintains Pool of Idle Processors • Balances Load by Assigning Enabled Tasks to Idle Processors • Locality Algorithm Affects the Assignment

  34. Locality Concepts Each Object has an Owner: Last processor to write the object. Owner has a current copy of the object. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor:Owner of locality object. Goal: Execute each task on its target processor.

  35. When Task Becomes Enabled • Scheduler Checks Pool of Idle Processors • If Target Processor is Idle Target Processor Gets Task • If Some Other Processor is Idle Other Processor Gets Task • No Processor is Idle Task is Held in Pool of Enabled Tasks

  36. When Processor Becomes Idle • Scheduler Checks Pool of Enabled Tasks • If Processor is Target of an Enabled Task Processor Gets That Task • If Other Enabled Tasks Exist Processor Gets one of Those Tasks • No Enabled Tasks Processor Stays Idle

  37. Implementation Versions Locality: Implementation uses Locality Algorithm No Locality: First Come, First Served Assignment of Enabled Tasks to Idle Processors Task Placement (Ocean and Panel Cholesky)Programmer assigns tasks to processors

  38. 100 100 75 75 50 50 task placement 25 25 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on iPSC/860 100 75 50 25 0 0 8 16 24 32

  39. Communication to Useful Computation Ratio on IPSC/860 (Mbytes/Second/Processor) 0.0025 0.0025 0.0020 0.0020 0.0015 0.0015 no locality 0.0010 0.0010 0.0005 0.0005 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string task placement 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky

  40. 32 32 24 24 16 16 task placement 8 8 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 4 4 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Speedup on iPSC/860

  41. Shared Memory Implementation • Model of Computation • Locality Optimization • Locality Performance Results

  42. Model of Computation • Single Shared Memory • Composed of Memory Modules • Each Memory Module Associated with a Processor • Each Object Allocated in a Memory Module • Processors Communicate by Reading and Writing Objects in the Shared Memory memory module shared memory object processor

  43. Locality Algorithm • Integrated into Online Scheduler • Scheduler Runs Distributed Task Queue • Each Processor Has a Queue of Enabled Tasks • Idle Processors Search Task Queues • Locality Algorithm Affects Task Queue Algorithm

  44. Locality Concepts Each Object has an Owner: Processor associated with memory module that holds object. Accesses to the object from this processor are satisfied from local memory module. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor: Owner of locality object. Goal: Execute each task on its target processor.

  45. When Processor Becomes Idle • If Its Task Queue is not Empty Execute First Task in Task Queue • Otherwise Cyclically Search Task Queues • If Remote Task Queue is not Empty Execute Last Task in Task Queue

  46. When Task Becomes Enabled • Locality Algorithm Inserts Task into Task Queue at the Owner of Its Locality Object • Tasks with Same Locality Object are Adjacent in Queue • Goals: • Enhance memory locality by executing each task on the owner of its locality object. • Enhance cache locality by executing tasks with the same locality object consecutively on same the processor.

  47. Evaluation • Same Set of Applications • Water • String • Ocean • Panel Cholesky • Same Locality Versions • Locality • No Locality (Single Task Queue) • Explicit Task Placement (Ocean and Panel Cholesky)

  48. 100 75 50 task placement 25 0 locality 0 8 16 24 32 water string no locality 100 100 75 75 50 50 25 25 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on DASH 100 75 50 25 0 0 8 16 24 32

  49. 4000 24000 3000 18000 2000 12000 1000 6000 0 0 0 8 16 24 32 0 8 16 24 32 400 300 200 100 0 0 8 16 24 32 Task Execution Time on DASH no locality locality water string task placement 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky

  50. task placement locality water string no locality ocean panel cholesky Speedup on DASH 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32

More Related