1 / 33

A New Optimization Technique for the Inspector-Executor Method

A New Optimization Technique for the Inspector-Executor Method. Daisuke Yokota † Shigeru Chiba ‡ Kozo Itano †. † University of Tsukuba ‡Tokyo Institute of Technology. Computer Simulation is Expensive. Physicists are running a parallel computer at our campus every day for simulation.

zagiri
Download Presentation

A New Optimization Technique for the Inspector-Executor Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Optimization Technique for the Inspector-Executor Method Daisuke Yokota† Shigeru Chiba‡ Kozo Itano† † University of Tsukuba ‡Tokyo Institute of Technology

  2. Computer Simulation is Expensive • Physicists are running a parallel computer at our campus every day for simulation. • Our target parallel computer costs$45,000 every month • $1 / min • International phone call between Japan and Canada. • The program runs very long. • A week or more.

  3. Hardware for Fast Inter-node Communication • Our computer SR2201 has such hardwarefor avoiding communication bottleneck • Should be used but not in the real… • At least, at our computer center • It is not used by compiler • Difficult to generate optimized code for that hardware • It is not used by programmer • Programmers are not computer engineers but physicists

  4. Our HPF Compiler • Optimization for • Utilizing hardware for inter-node communication • Technique • The Inspector-Executor methodplus Static Code Optimization • Compilation is executed in parallel • Target • Hitachi SR2201

  5. Optimizations • Reducing the amountof the exchanged data • Our compiler allocates loop iterations to appropriate nodes for minimizing communication • Merging multiple messages • Our target computer provides hardware support • Our compiler tries to use that hardware • Reusing TCW • Another hardware support • To reduce setup time for each message sending

  6. Merging Multiple Messages • Hardware support: • Block-Stride Communication • Multiple messages are sentas a single message (Data must be stored at regular Intervals) Sender Receiver

  7. Reusing TCW • TCW: Transfer Control Word • Reusing parameters to the communication hardware setting do I=1,… end do do I=1,… end do send setting send before optimization after optimization

  8. Implementation:Original Inspector-Executor Method • Goal: Parallelize a loop by runtime analysis • Inspector runs at runtime Inspector Executor Resulting data of the analysis Determines which array elements must be exchanged among nodes • Exchanges array elements • Executes a loop body in parallel • Exchanges array elements

  9. Our ImprovedInspector Executor Method • Inspector produces statically optimized code of the executor. • Inspector runs off-line. • Running Inspector is part of the compilation process. Inspector Executor Optimized executor code - Not data!

  10. Static Code Optimization • Inspector performs constant folding • When generating the executor code • Constant folding eliminates from Executor: • A table containing the result of the analysisby Inspector • Saves memory space (the table size is big!) • Memory access for table-lookup • Better performance

  11. OUTER directive • Specifies the range of analysis by Inspector. • OUTER Loop • We assume that the program structure fits the structure of typical simulation programs. This repeats millions of times during the simulation. OUTER Loop Executor This is parallelized. INNER Loop

  12. Restrictions • Programmers must guarantee … • Every iteration of the OUTER loop needs to exchange the same set of array elements among nodes. • Since Inspector analyzes only the first iteration • The set of exchanged array elements is determined without executing inter-node communication • Inspector does not perform the communication for reducing the compilation time • Our compiler cannot compile IS of NAS parallel benchmark

  13. Our Compiler Runs on a PC Cluster • For executing inspectorin parallel. • Inspector must analyze a largeamount of data. • In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.

  14. 〃 〃 〃 〃 〃 〃 Execution Flow of Our Compiler Source Program Generate Inspector Generate Inspector Inspector Log Inspector Log Analysis Analysis Exchange Information of Messages Code Generation Code Generation Translate into SPMD SPMD Parallel code

  15. Our Prototype Compiler • Fortran77 + HPF + OUTER directive • Output: SPMD Fortran code • Target machine • Compilation:PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet • Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes

  16. Experiments: Pde1 benchmark • Poisson Equation • Good for massively parallel computing • Regular array accesses • High scalability • Distributed array accesses are centralized in a small region of source code

  17. Execution Time (pde1) 20 15 Ours 249sec Hitachi HPF 10 Linear Speedup 5 137,100sec 0 1 2 4 8 16 Number of nodes Hitachi’s HPF compiler needs more directives for better performance

  18. Effects by static code optimization (pde1) Reduction of execution time Number of nodes

  19. Long compilation time is paid off if the OUTER loop iterates many times. Compilation Time (pde1) 250 200 backend Fortran 150 sequential Compilation time (sec) parallel 100 data exchange 50 0 2 4 8 16 Number of nodes

  20. Experiment: FT-a • 3D Fourier Transformation • Features • Irregular array accesses • Distributed array accesses are centralized in a small region of source code

  21. Execution Time (FT-a) 20 15 Ours Hitachi HPF 46sec 10 Speedup Linear 5 4,898sec 0 1 2 4 8 16 Number of nodes

  22. Compilation Time (FT-a) 350 300 backend 250 sequential 200 Compilation Time (sec) parallel 150 data exchange 100 50 0 2 4 8 16 Number of nodes

  23. Experiments: BT-a • Block Tri-diagonal Solver • Features • A small number of irregular array accesses • Distributed array accesses are scattered all over the source code

  24. Execution Time (BT-a) 20 15 Ours 1,430sec 10 Hitachi HPF Speedup Linear 5 1,370,000sec 0 1 2 4 8 16 Number of nodes

  25. Our compiler cannot achieve good performance Compilation Time (BT-a) 40000 backend 35000 sequential 30000 parallel 25000 Compilation Time (sec) data exchange 20000 15000 10000 5000 0 2 4 8 16 Number of nodes Inspector must analyze a huge number of array accesses

  26. Conclusion • HPF compiler for utilizing hardware for inter-node communication • Inspector-executor method • Static code optimization • Inspector produces optimized executor code • Compiler runs on a PC cluster • Experiment • Long compilation time is acceptable for simulation programs running for long time

  27. 予備

  28. 通信量の削減(最適化) • 通信量が少なくなるようにループのくり返しを分配 • データの分割はHPFで指示 • 予備実行で発生するであろう通信量を調べる P E 2 必要な通信量 P E 1 ループのくり返し P E 1 P E 1 P E 2 P E 2 受け持つプロセッサ

  29. Merging Multiple Messages • Our compiler collects several messages sent in a single message • Messages in the loop with INDEPENDENT directive can be merged • This directive specifies that the result of that loop is independent of the execution order of the iterations • Our compiler finds block-stride communication to reduce a number of communication by pattern matching

  30. Future Works • We want to reduce a number of communication more • We want to use block stride communication more aggressively (If with sending redundant data they could be merged into small number of communication, ◎) • Prevention of expanding too long code • If data dependency between processors are too complex, our compiler generates too many communication operations • Improvement of scalability of compilation time • Inspector log by BT was too huge • Experiments with real simulations

  31. CP-PACS/Pilot-3 • Distributed memory machine • Center for Computational Physics at University of Tsukuba • 2048PEs(CP-PACS),128PEs(Pilot-3) • Hyper crossbar • RDMA

  32. Our Optimizer to Solve the Problem • Use of special communication devices • Parallel machines sometimes have special hardware to reduce a time for inter-node communication • Development of compilers for easy and well-know computer languages • Fortran77, Simple HPF(High Performance Fortran) • Runtime analysis • Profiler about communication on PC-cluster

  33. Effects by static code optimization (pde1) Reduction of execution time Number of nodes

More Related