Fine-Grain Thread Implementation on Parallel Computers

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo

Background • “Irregular” parallel applications • Tasks are not identified until runtime • synchronization structure is complicated • Languages with fine-grain threads • promising approach to handle the complexity

Motivation Q: Are fine-grain threads really effective? • Easy to describe irregular parallelism? • Scalable? • Fast? Many sophisticated designs and implementation techniques have been proposed so far, but Case studies to answer the Q are few

in terms of • program description cost • speed on 1 PE • scalability on 64PE SMP Goal • Case study to better understandthe effectiveness of fine-grain threads C + Solaris threads our language Schematic approach w/o fine-grain threads approach with fine-grain threads VS.

Overview • Applications ( RNA & CKY ) • Solutions without fine-grain threads • Solutions with fine-grain threads • Performance evaluation

Case Study 1: RNA- protein secondary structure prediction - • finding a path • satisfying certain condition • with largest weight Algorithm simple node traversal ＋ pruning unbalanced tree

Case Study 2: CKY- context-free grammar parser - She is a girl whose mother is a teacher. actual size ≒ 100 calculation of matrix elements depends on all s calculation time significantly varies from element to element

large overhead communication with memory Task Pool P P P Solution without Fine-grain Threads(RNA) To create a thread for each node

how to implement? • small delay → simple spin • large delay → block wait decision strategy? P P P • trial & error • prediction Solution without Fine-grain Threads(CKY ) calculating 1 element → 0 ～ 200 synchronization

Language with Fine-grain Threads • Schematic [Taura et. al 96] = Scheme + future + touch[Halstead 85] (define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2))))) thread creation channel synchronization

future future future future future future future future future future future future Thread Management in Schematic • Lazy Task Creation [Mohr et al. 91] stack PE A PE B

register memory register register memory register register memory register register register register Synchronization on Register • StackThreads [Taura 97] PE A PE B

work A simple spin if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ... } cont(c, v) { } work B ver. 1; + heuristics to decide which to duplicate work B ver. 2; block wait Synchronization by Code Duplication work A (touch r) work B

Schematic C + thread What description can be omittedin Schematic? • Management of fine-grain tasks • Synchronization details future ⇔ manipulation of task pool ＋ load balance touch ⇔ manipulation of comm. medium ＋ aggressive optimizations

Codes for Parallel Execution Schematic C int search_node(...) { if (condition) { } else { child = ...; ... search_node(...); ... ... ... } (define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...))) RNA for parallel execution whole: 1566 lines whole: 453 lines parallel: 537 lines (34 %) parallel: 29 lines (6.4 %)

Performance Evaluation(Condition) • Sun Ultra Enterprise 10000(UltraSparc 250MHz × 64) • Solaris 2.5.1 • Solaris thread (user-level thread) • GC time not included • Runtime type check omitted

Performance Evaluation(Sequential)

Performance Evaluation(Parallel)

Related Work • ICC++ [Chien et al. 97] • Similar study using 7 apps • Experiments on distributed memory machines • Focus on • namespace management • data locality • object-consistency model

Conclusion • We demonstrated the usefulness of fine-grain multithread languages • Task pool-like execution with simple description • Aggressive optimizations for synchronization • We showed the experimental results • A factor of 2.8 slower than C • Scalability comparable to C

Performance Evaluation(Other Applications 1/2)

Performance Evaluation(Other Applications 2/2)

Identifying Overheads

Fine-Grain Thread Implementation on Parallel Computers

Fine-Grain Thread Implementation on Parallel Computers

Presentation Transcript

Oyama Bonsai Kai Ombú “Tree”

Endo Part 3

S. Ted Oyama

ENDO HOUR

ENDO HOUR

ENDO HOUR

Yoshihiro Akiyama

Yoshihiro Kaneko

ENDO STITCH

Endo.

Endo III

Akinori Yonezawa

endo exo

endo-cannabinoids

endo

Endo Oil