1 / 29

Cooperative Parallelization

Cooperative Parallelization. Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University. Outline. Motivation Introduction Cooperative Parallelization Programmer’s Input Evaluation Conclusion. Motivation. Program parallelization is a difficult task

kalani
Download Presentation

Cooperative Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cooperative Parallelization Praveen Yedlapalli EmreKultursay MahmutKandemir The Pennsylvania State University

  2. Outline • Motivation • Introduction • Cooperative Parallelization • Programmer’s Input • Evaluation • Conclusion

  3. Motivation • Program parallelization is a difficult task • Automatic parallelization helps in parallelizing sequential applications • Most of the parallelizing techniques focus on array based applications • Limited support for parallelizing pointer-intensive applications

  4. Example void traverse_tree (Tree *tree) { if (tree−>left) traverse_tree(tree->left); if (tree->right) traverse_tree(tree->right); process(tree); } void traverse_list (List * list) { List * node = list; while ( node != NULL ) { process(node); node = node−>next; } } Tree Traversal List Traversal

  5. Introduction • Program Parallelization is a 2-fold problem • First Problem: Finding where parallelism is available in the application if any • Second Problem: Deciding how to efficiently exploit the available parallelism

  6. Finding Parallelism • Use static analysis to perform dependence checking and identify independent parts of the program • Target regular structures like arrays and for loops • Pointer intensive codes cannot be analyzed accurately with static analysis

  7. Pointer Problem • Pointer intensive applications typically have • Data structures built from input • and while loops to traverse the data structures • Without the points-to information and with out loop counts there is very little we can do at compile time

  8. Exploiting Parallelism • In array based applications with for loops sets of iterations are distributed to different threads • In pointer intensive applications information about the data structure is needed to run the parallel code

  9. Programmer’s Input • The programmer has high level view of the program and can give hints about the program • Hints can indicate things like • If a loop can be parallelized • If function calls are independent • Structure of the working data • All of these bits of information are vital in program parallelization

  10. Application Runtime Information • To efficiently exploit parallelism in pointer intensive applications we need runtime information • Size and shape of data structure (dependent on input) • Points-to information • Using the points-to information we determine the work distribution

  11. Cooperative Parallelization Programmer(hints) Runtime System Compiler Cooperative Parallelization SequentialProgram Parallel Program

  12. Cooperative Parallelization • Cooperation between the programmer, the compiler and the runtime system to identify and efficiently exercise parallelism in pointer intensive applications • The task of identifying parallelism in the code is delegated to the programmer • Runtime system is responsible for monitoring the program and efficiently executing parallel code

  13. Application Characteristics • Pointer-intensive applications • A data structure is built from the input • The data structure is traversed several times and nodes are processed • The operations on nodes are typically independent • This fact can be obtained from the programmer as a hint

  14. Tree Example int perimeter (QuadTree tree, int size) { intretval = 0; if (tree−>color==grey) { /*node has children */ retval += perimeter (tree−>nw, size/2); retval += perimeter (tree−>ne, size/2); retval += perimeter (tree−>sw, size/2); retval += perimeter (tree−>se, size/2); } else if (tree−>color==black) { ... /* do something on the node*/ } return retval; } tree nw subtree se subtree … Function from perimeter benchmark

  15. List Example void compute_node (node_t * nodelist) { inti; while ( nodelist != NULL ) { for (i=0; i < nodelist−>from_count; i++) { node_t *other_node = nodelist−>from_nodes[i]; double coeff = nodelist−>coeffs[i]; double value = other_node−>value; nodelist−>value −= coeff * value; } nodelist = nodelist−>next } } Function from em3d benchmark head sublist 1 sublist n . . .

  16. Runtime System • Processing of different parts of the data structure (sub problems) can be done in parallel • Needs access to multiple sub problems at runtime • The task of finding these sub problems in the data structure is done by a helper thread

  17. Helper Thread • The helper thread goes over the data structure and finds multiple independent sub problems • The helper thread doesn’t need to traverse the whole data structure to find the sub problems • Using a separate thread for finding the sub problems reduces the overhead

  18. Approach Sequential Execution Parallel Execution helper thread application threads loop loop

  19. Code Structure helper thread: wait for signal from main thread find subproblems in the data structure signal main thread application thread: wait for signal from main thread work on the subproblems assigned to this thread signal main thread main thread: signal helper thread when data structure is ready wait for signal from helper thread distribute subproblems to application threads signal application threads wait for signal from application threads merge results from all the application threads

  20. Profitability • The runtime information collected is used to determine the profitability of parallelization • This decision can be driven by the programmer using a hint • The program is parallelized only if the data structure is “big” enough

  21. Programmer Hints • Interface between the programmer and the compiler • Should be simple to use with minimal essential information #parallel tree function (threads) (degree) (struct) {children} threshold [reduction] #parallel llist function (threads) (struct) (next_node) threshold [number]

  22. Automation • Implemented a source-to-source translator • Modified C language grammar to understand the hints C program with hints Parser Generator Modified C grammar Translator Parallel program

  23. Experimental Setup All benchmarks except otter are from olden suite

  24. Evaluation 15x speedup

  25. Overheads • Helper thread can be invoked before the main thread reaches the computation to overlap the overhead of finding the sub problems • Helper thread in general traverses a part of the data structure and takes very less time compared to the original function

  26. Comparison to OpenMP • Open MP 3.0 supports task parallelism • Directives can be added in the code to parallelize while loops and recursive functions • Open MP tasks doesn’t take application runtime information into consideration • Tasks tend to be fine grain • Significant performance overhead

  27. Related Work • Speculative parallelization can help in parallelizing programs that are difficult to analyze • That comes at the cost of executing instructions which might not be useful • Power and Performance overhead • Our approach is a non-speculative way of parallelization

  28. Conclusion • Traditional parallelization techniques cannot efficiently parallelize pointer intensive codes • Combining programmer’s knowledge and application runtime information we can exploit parallelism in such codes • The idea presented is not limited to trees and linked lists and can be extended to other dynamic structures like graphs

  29. Thanks You Questions ?

More Related