1 / 38

Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC

Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC. Tseng-Hui (Frank) Lin thlin@npac.syr.edu tsenglin@us.ibm.com. Performed functions still useful Large user population Invested big money Rewriting is expensive Rewriting is risky.

macha
Download Presentation

Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC Tseng-Hui (Frank) Lin thlin@npac.syr.edu tsenglin@us.ibm.com

  2. Performed functions still useful Large user population Invested big money Rewriting is expensive Rewriting is risky Changed through long time period Modified by diff people Historical code Dead code Old concepts Major bugs fixed Legacy Applications

  3. What Legacy Applications Need • Provide higher resolution • Run bigger data • Graphic representation for scientific data • Keep certified

  4. How to Meet the Requirements • Improve performance: Parallel computing • Keep Certified: Change critical parts only • Better user interface: Add GUI

  5. System Configuration

  6. Distributed vs Shared Memory

  7. Message Passing Programming • Non-parallelizable parts • Data dependent forces sequential execution • Not worthy to parallelize • Workload distribution  • Input data distribution  • Distributed Computation  • Load balance • Results collection 

  8. Non-Parallelable Parts Amdahl’s Law

  9. MOPAC • Semi-empirical molecular orbital pkg • MNDO, MINDO/3, AM1, PM3 • MOPAC 3 submitted to QCPE in 1985 • MOPAC 6 ported to many platforms • VMS • UNIX (our work based on this version) • DOS/Windows • MOPAC 7 is current

  10. MOPAC input file L1 : UHF PULAY MINDO3 VECTORS DENSITY LOCAL T=300 L2 : EXAMPLE OF DATA FOR MOPAC L3 : MINDO/3 UHF CLOSED-SHELL D2D ETHYLENE L41: C L42: C 1.400118 1 L43: H 1.098326 1 123.572063 1 L44: H 1.098326 1 123.572063 1 180.000000 0 2 1 3 L45: H 1.098326 1 123.572063 1 90.000000 0 1 2 3 L46: H 1.098326 1 123.572063 1 270.000000 0 1 2 3 L5 : Keywords Title Comments Blank Line End-of-Data Molecule Structure in Z-matrix (Internal Coordinate)

  11. Hartree-Fock Self Consistent Field Schrödinger equation Matrix equation form (3.1.3) Matrix representation of Fock matrix (3.1.4)

  12. HF-SCF Procedure S1: Calc molecular integrals O(n4) S2: Guess initial eigenvector C S3: Use C to compute F O(n4) S4: Transform F to orthogonal basis O(n3) diagonalize F to get a new C O(n3) S5: Stop if C converged S6: Guess new C and goto S3

  13. MOPAC computation • Ab initio HF-SCF • evaluate all integrals rigorously • accuracy • requires high computing power • limited molecule size • Semi-empirical HF-SCF • use the same procedure • reduce computational complexity • support larger molecule size

  14. Semi-empirical SCF • Ignore some integrals • Use experiment results to replace integrals • Assume AO basis is orthogonal S1, S3: O(n4)=>O(n2) S4 orthogonalization not needed New bottle neck: diagonalization Complexity: O(n3)

  15. Parallelization Procedure • Sequential analysis • Time profiling analysis • Program flow analysis • Comp Complexity analysis • Parallel analysis • Data dependence resolution • Loop parallelization • Integration • Communication between modules

  16. Sequential Analysis • Time profiling analysis • Pick up the computational intensive parts • Usually use smaller input data • Program flow analysis • Verify the chosen ones are commonly used • Domain expert not required • Comp Complexity analysis • Workload distribution changed significantly for different data sizes

  17. MOPAC Sequential Analysis Assume the complexity of the rest part is O(n2)

  18. Loop Parallelization • Scalar forward subst: remove temp vars • Induction variable subst: resolv depend • Loop interchange/merge enlarge granularity, reduce synchronization • Scalar expansion resolve data dependence on scalars • Variable copying resolve data dependence on arrays

  19. MOPAC Parallelization: DENSIT • Function: compute density matrix • 2 1-level loops inside of a 2-level loop • Triangular computational space • Merge the outer 2-level loop to 1 loop with range [1..n(n+1)/2] • Lower comp/comm ratio (when n small) • benefit from low latency communication when n is small

  20. MOPAC Parallelization: DIAG • P1: Generate Fock modular orbital matrix • Higher comp/comm ratio • Find global maximum TINY from local ones • Need to re-distribute matrix FMO for Part 2 • P2: 2X2 rotation to eliminate significant off-diagonal elements • “if” structure cause load imbalance • Need to exchange the inner most loop out • Some calculations run on all nodes to save comm

  21. MOPAC Parallelization: HQRII • Function: standard eigensolver • R. J. Allen survey • Use PNNL PeIGS pdspevx() function • Use MPI communication library • Small chunk data exchange, good if n/p>8 • Implemented in C, different way to pack matrix (row major)

  22. Integration

  23. Comm between Modules • Parallel - sequential • Use TCP/IP • Auto upgrade to shared memory if possible • Sequential - user interface • Input and output files • Application/Advanced Visualization System (AVS) remote module communication • User interface - display • AVS

  24. MOPAC Cntl Panel & Module

  25. MOPAC GUI

  26. Data Files and Platform • Platforms: • SGI Power Challenge • IBM SP2

  27. DENSIT Speed-up

  28. DENSIT Speed-up Power Challenge SP2

  29. DIAG Speed-up

  30. DIAG Speed-up Power Challenge SP2

  31. HQRII Speed-up

  32. HQRII Speed-up Power Challenge SP2

  33. Overall Speed-up Power Challenge SP2 Projected, assuming sequential part is O(n2)

  34. Overall Speedup Assume non-parallelizable part is O(1) and O(n2)

  35. Related work: IBM Application: Conformational search Focus: Throughput

  36. Related work: SDSC • Focus: performance • Parallelizing: • Evaluate electronic repulsion integrals • Calculate first and second derivatives • Solve eigensystem • Platform: 64-node iPSC/860 • Results: • Geometry optimization: speedup=5.2 • Vibration analysis: speedup=40.8

  37. Achievements • Parallelize legacy apps from CS perspective • Keep code validated • Performance analysis procedures • Predict large data performance • Optimize parallel code • Improve performance • Improve user interface

  38. Future Work • Shared memory model • Web based user interface • Dynamic node allocation • Parallelization of subroutines with lower computational complexity

More Related