1 / 33

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load. Jack Dongarra Kenneth Roche. Javier Cuenca Domingo Giménez José González. Optimisation of Linear Algebra Routines. Traditional method: Hand-Optimisation for each platform Time-consuming

sulwyn
Download Presentation

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Jack Dongarra Kenneth Roche Javier Cuenca Domingo Giménez José González

  2. Optimisation of Linear Algebra Routines • Traditional method: Hand-Optimisation for each platform • Time-consuming • Incompatible with Hardware Evolution • Incompatible with changes in the system • (architecture and basic libraries) • Unsuitable for systems with variable load • Misuse by non expert users

  3. Our Approach D E S I G N R U N - T I M E Modelling the Linear Algebra Routine (LAR): Texec = f (SP, AP, n) SP: System Parameters AP: Algorithmic Parameters n: Problem size Execution of LAR Selection of AP values I N S T A L L A T I O N Estimation of SP

  4. Our Approach Static Model of LAR: Situation of platform at installation time LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2

  5. Our Approach Static Model of LAR: Situation of platform at installation time DynamicModel of LAR: Situation of platform at run-time. LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2

  6. DESIGN PROCESS D E S I G N LAR LAR: Linear Algebra Routine Made by the LAR Designer Example of LAR: Parallel Block LU factorisation

  7. Modelling the LAR D E S I G N LAR Modelling the LAR MODEL

  8. Modelling the LAR D E S I G N LAR Made by the LAR-Designer Only once per LAR Modelling the LAR MODEL SP: System Parameters AP: Algorithmic Parameters n : Problem size MODEL Texec = f (SP, AP, n)

  9. Modelling the LAR D E S I G N LAR SP: k3, k2, ts, tw AP: p, b n : Problem size Modelling the LAR MODEL MODEL LAR: Parallel Block LU factorisation

  10. Implementation of SP-Estimators D E S I G N LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators

  11. Implementation of SP-Estimators D E S I G N LAR Modelling the LAR Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication Similar quantity of data MODEL Implementation of SP-Estimators SP-Estimators

  12. INSTALLATION PROCESS D E S I G N LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Installation Process Only once per Platform Done by the System Manager

  13. Estimation of Static-SP D E S I G N LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  14. Estimation of Static-SP D E S I G N Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS LAR Modelling the LAR Installation File SP values are obtained using the information (n and AP values) of this file. MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  15. Estimation of Static-SP D E S I G N Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI LAR Modelling the LAR Estimation of the Static-SP k3-static (in sec) Block size 16 32 64 128 k3-static0.0038 0.0033 0.0030 0.0027 MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Estimation of the Static-SP tw-static (in sec) Message size (Kbytes) 32 256 1024 2048 tw-static0.700 0.690 0.680 0.675 Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  16. RUN-TIME PROCESS D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  17. RUN-TIME PROCESS: Static approach D E S I G N R U N - T I M E LAR Modelling the LAR Optimum-AP MODEL Selection of Optimum AP Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  18. RUN-TIME PROCESS: Static approach D E S I G N R U N - T I M E LAR Execution of LAR Modelling the LAR Optimum-AP MODEL Selection of Optimum AP Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  19. RUN-TIME PROCESS:Dynamic Approach D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File Estimation of Static-SP Static-SP-File

  20. Call to NWS D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  21. Call to NWS R U N - T I M E The NWS is called and it reports: ·the fraction of available CPU (fCPU) ·the current word sending time (tw-current) for a specific n and AP values (n0, AP0). Then the fraction of available network is calculated: NWS Information Call to NWS

  22. Call to NWS D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L L A T I O N Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  23. Dynamic Adjustment of SP D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Current-SP I N S T A L L A T I O N Dynamic Adjustment of SP Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  24. Dynamic Adjustment of SP R U N - T I M E The values of the SP are adjusted, according to the current situation: Current-SP Dynamic Adjustment of SP NWS Information Call to NWS Static-SP-File

  25. Dynamic Adjustment of SP D E S I G N R U N - T I M E LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Current-SP I N S T A L L A T I O N Dynamic Adjustment of SP Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  26. Selection of Optimum AP D E S I G N R U N - T I M E LAR Modelling the LAR Optimum-AP MODEL Selection of Optimum AP Implementation of SP-Estimators SP-Estimators Current-SP I N S T A L L A T I O N Dynamic Adjustment of SP Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  27. Execution of LAR D E S I G N R U N - T I M E LAR Execution of LAR Modelling the LAR Optimum-AP MODEL Selection of Optimum AP Implementation of SP-Estimators SP-Estimators Current-SP I N S T A L L A T I O N Dynamic Adjustment of SP Basic Libraries Installation-File NWS Information Estimation of Static-SP Call to NWS Static-SP-File

  28. Platform load: different situations studied nodo1 nodo2 nodo3 nodo4 nodo5 nodo6 nodo7 nodo8 Situation A CPU avail. 100% 100% 100% 100% 100% 100% 100% 100% tw-current0.7sec Situation B CPU avail. 80% 80% 80% 80% 100% 100% 100% 100% tw-current0.8sec 0.7sec Situation C CPU avail. 60% 60% 60% 60% 100% 100% 100% 100% tw-current1.8sec 0.7sec Situation D CPU avail. 60% 60% 60% 60% 100% 100% 80% 80% tw-current1.8sec 0.7sec 0.8sec Situation E CPU avail. 60% 60% 60% 60% 100% 100% 50% 50% tw-current1.8sec 0.7sec 4.0sec

  29. Optimum AP for the different situations studied Block size Situations of the Platform Load n A B C D E 1024 32 32 64 64 64 2048 64 64 64 128 128 3072 64 64 128 128 128 Number of nodes to use p = r  c Situations of the Platform Load n A B C D E 1024 42 42 22 22 21 2048 42 42 22 22 21 3072 42 42 22 22 21

  30. Experimental Time:deviations from the Optimum

  31. Experimental Time:deviations from the Optimum

  32. Experimental Time:deviations from the Optimum

  33. Conclusions and Future Work • The use of the proposed methodology is viable in systems where the load is stable or variable. • Software like NWS is suitable for the adjustment of the system parameters’ values obtained at installation time. • The heterogeneous load case offers many more possibilities than the one studied.

More Related