260 likes | 372 Views
Implementing and evaluating language with fine-grain thread creation on shared memory parallel computer. Examining effectiveness, scalability, and speed for irregular parallel applications. Case studies provide insights on program cost and scalability. Comparing solutions with and without fine-grain threads through RNA and CKY examples. Detailed evaluation of performance and thread management. Summary conclusions on the experimental results and identified overheads.
E N D
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo
Background • “Irregular” parallel applications • Tasks are not identified until runtime • synchronization structure is complicated • Languages with fine-grain threads • promising approach to handle the complexity
Motivation Q: Are fine-grain threads really effective? • Easy to describe irregular parallelism? • Scalable? • Fast? Many sophisticated designs and implementation techniques have been proposed so far, but Case studies to answer the Q are few
in terms of • program description cost • speed on 1 PE • scalability on 64PE SMP Goal • Case study to better understandthe effectiveness of fine-grain threads C + Solaris threads our language Schematic approach w/o fine-grain threads approach with fine-grain threads VS.
Overview • Applications ( RNA & CKY ) • Solutions without fine-grain threads • Solutions with fine-grain threads • Performance evaluation
Case Study 1: RNA- protein secondary structure prediction - • finding a path • satisfying certain condition • with largest weight Algorithm simple node traversal + pruning unbalanced tree
Case Study 2: CKY- context-free grammar parser - She is a girl whose mother is a teacher. actual size ≒ 100 calculation of matrix elements depends on all s calculation time significantly varies from element to element
large overhead communication with memory Task Pool P P P Solution without Fine-grain Threads(RNA) To create a thread for each node
how to implement? • small delay → simple spin • large delay → block wait decision strategy? P P P • trial & error • prediction Solution without Fine-grain Threads(CKY ) calculating 1 element → 0 ~ 200 synchronization
Language with Fine-grain Threads • Schematic [Taura et. al 96] = Scheme + future + touch[Halstead 85] (define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2))))) thread creation channel synchronization
future future future future future future future future future future future future Thread Management in Schematic • Lazy Task Creation [Mohr et al. 91] stack PE A PE B
register memory register register memory register register memory register register register register Synchronization on Register • StackThreads [Taura 97] PE A PE B
work A simple spin if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ... } cont(c, v) { } work B ver. 1; + heuristics to decide which to duplicate work B ver. 2; block wait Synchronization by Code Duplication work A (touch r) work B
Schematic C + thread What description can be omittedin Schematic? • Management of fine-grain tasks • Synchronization details future ⇔ manipulation of task pool + load balance touch ⇔ manipulation of comm. medium + aggressive optimizations
Codes for Parallel Execution Schematic C int search_node(...) { if (condition) { } else { child = ...; ... search_node(...); ... ... ... } (define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...))) RNA for parallel execution whole: 1566 lines whole: 453 lines parallel: 537 lines (34 %) parallel: 29 lines (6.4 %)
Performance Evaluation(Condition) • Sun Ultra Enterprise 10000(UltraSparc 250MHz × 64) • Solaris 2.5.1 • Solaris thread (user-level thread) • GC time not included • Runtime type check omitted
Related Work • ICC++ [Chien et al. 97] • Similar study using 7 apps • Experiments on distributed memory machines • Focus on • namespace management • data locality • object-consistency model
Conclusion • We demonstrated the usefulness of fine-grain multithread languages • Task pool-like execution with simple description • Aggressive optimizations for synchronization • We showed the experimental results • A factor of 2.8 slower than C • Scalability comparable to C