1 / 29

Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacM

Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacMillen, Devan Williams, Xiren Wang. Outline. The need for parallelization Challenges towards effective parallelization

ziv
Download Presentation

Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacMillen, Devan Williams, Xiren Wang

  2. Outline • The need for parallelization • Challenges towards effective parallelization • A multilevel parallelization framework for BEM: A compute intensive application • Results and discussion Proprietary & Confidential

  3. The Need For Parallelization • The commodity computing environment underwent a paradigm shift in the last 7 years Power hungry frequency-scaling replaced by more efficient scaling-in-computing-cores 60M 5M 3.4G • “No free-lunch” for software applications Proprietary & Confidential

  4. Parallelization: The Easy Way Tool.exe TestCase 1 TestCase 1 TestCase 2 TestCase 3 TestCase 2 TestCase 3 TestCase 4 TestCase 4 ALU ALU Tool.exe ALU ALU Tool.exe Tool.exe Tool.exe TestCase .. Cache TestCase .. Shared Memory Case 2 Case 1 Case3 Case 4 • “Embarrassingly Parallel” Class of Problems • - Suitable for applications with low memory requirement • What about memory intensive applications ? • - Large SPICE runs, Extraction Proprietary & Confidential

  5. Parallelization Paradigms CPU – GPU (FPGA) Multithreading Distribution MPI Proprietary & Confidential

  6. Benefits of Effective Parallelization Algorithm in which 100% can be efficiently parallelized Algorithm in which 50% can be efficiently parallelized Proprietary & Confidential

  7. Outline • The need for parallelization • Challenges towards effective parallelization • A multilevel parallelization framework for BEM: A compute intensive application • Results and discussion Proprietary & Confidential

  8. Amdahl’s Law Scaling = P = Fraction of the algorithm that can be parallelized (1-P) = Serial content of the algorithm Parallel performance of an algorithm is severely affected by any serial content Proprietary & Confidential

  9. Amdahl’s Law: Cloud Perspective Cost Multiplier = Speed Up = Ideal: P=1 P=0.9 P=0.999 P=0.99 P=0.99 P=0.9 P=0.999 Ideal: P=1 • Power of perfect scaling: Commodity computing cost • does not increase even when solving 100x faster Proprietary & Confidential

  10. Outline • The need for parallelization • Challenges towards effective parallelization • A multilevel parallelization framework for BEM: A compute intensive application • Results and discussion Proprietary & Confidential

  11. Application: Large scale problems Signal Integrity EMI Power Integrity Large Scale Package-Board Analysis

  12. Pulse Boundary Element Field Solver - Surface is discretized into patches (Basis Functions) Dense Method of Moments Matrix

  13. Fast Iterative Matrix Solution N = Number of basis functions; (150,000) Large problems: N ~ 1 million

  14. FMM: Fast Solver Physics (1992) Multi-level Algorithm Astronomy Equivalent Interaction Shell Gravitational Image: Courtesy ParticleMan Proprietary & Confidential

  15. FMM: Fast Solver Algorithm (1992) Sequential Algorithm: Impedes Effective Parallelization M2L Finest - 1 Level M2M L2L L2L M2M M2L Finest Level M2L Q2M L2P Down Tree Up Tree Across Tree Proprietary & Confidential

  16. SVD/QR Based Compression Zsub Dense MoM Matrix SVD/QR Decomposition Saving Zsub Q r R n m = m n r Proprietary & Confidential

  17. Predetermined Matrix Structure • Pre-determined matrix structure • Sub-matrix ranks pre-estimated to reasonable accuracy Proprietary & Confidential

  18. OpenMP: Matrix Setup and Solve Predetermined matrix structure given meshed geometry • Static load-balancing • Correction by dynamic scheduling Proprietary & Confidential

  19. Data Comm. MPI processes OpenMP Worker Threads OpenMP Master Threads MPI: Matrix Setup and Solve Start • All nodes read geometry and mesh • Combined DRAM of nodes are used to store • the whole matrix, increasing capacity • Solve: Vectors are broadcast for matrix- • vector operations Read Geometry, Mesh Setup Solve Proprietary & Confidential

  20. SGE: Parallel Frequency Distribution Multiple cores use shared memory utilized to store and compute single frequency matrix Different frequency matrices on separate node memory Reveals an embarrassingly parallel level without wasting shared memory Proprietary & Confidential

  21. Hybrid Parallel Framework • Several layers of parallelism, including embarrassingly parallel layers are • employed in a hybrid framework to work around Amdahl’s law and • achieve high scalability • Global controller decides the number and type of compute instances • to be used at each level • File R/W for SGE based parallel layers is performed in binary Proprietary & Confidential

  22. Outline • The need for parallelization • Challenges towards effective parallelization • A multilevel parallelization framework for BEM: A compute intensive application • Results and discussion Proprietary & Confidential

  23. Matrix Setup Analysis • Predetermined matrix structure and static-dynamic load balancing achieves • close to ideal speed ups • Lack of dynamic scheduling across nodes affects MPI speedup Proprietary & Confidential

  24. Matrix Solve Analysis • For large number of right hand sides, SGE based RHS distribution yields • better scalability • Map-reduce after every matrix vector product affects parallel efficiency Proprietary & Confidential

  25. Performance on Full-Board Analysis Proprietary & Confidential

  26. Performance on Amazon EC2 Proprietary & Confidential

  27. Performance on Amazon EC2 System LSI package and an early design board merged Meshed with 20,000 basis functions Broadband frequency simulation for EMI prediction Proprietary & Confidential

  28. EMI from Cell-Phone Package Proprietary & Confidential

  29. Summary • The future of compute and memory intensive algorithms is “parallel” • Revisit algorithms from a perspective of multi-processor efficiency as opposed to single-processor complexity • Even a small serial content can adversely affect the speedup of an algorithm • Employ levels of parallelism to work-around Amdahl’s law • Presented algorithm achieves around 6.5x speedup on a single 8 core machine and around 50x speed up in a cloud framework using 72 (18x4) cores Proprietary & Confidential

More Related