1 / 17

Effective Automatic Parallelization of Stencil Computations *

Effective Automatic Parallelization of Stencil Computations *. Sriram Krishnamoorthy 1 Muthu Baskaran 1 , Uday Bondhugula 1 , Atanas Rountev 1 , J. Ramanujam 2 , P. Sadayappan 1 1 The Ohio State University 2 Lousiana State University. * Work supported by NSF. Introduction.

duena
Download Presentation

Effective Automatic Parallelization of Stencil Computations *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Automatic Parallelization of Stencil Computations* Sriram Krishnamoorthy1 Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1 1The Ohio State University 2Lousiana State University * Work supported by NSF

  2. Introduction • Stencil computations • Sweep through large data set • Multiple time iterations • Simple load balanced schedule • Tiling – essential to improve data locality • Dependences between tiles • Pipelined execution • Skewed iteration spaces – load imbalance • Solution: Adjust tiling – re-enable concurrent execution

  3. Motivation FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3 t i

  4. Notation • Iteration space B: n-dim polyhedron • Dependences D: n-dim vectors • Hyperplanes H: • n-dim normal vectors • Tile bounded by pairs of hyperplanes

  5. Approach • Concurrent start in non-tiled iteration space • Identify hyperplanes inhibiting concurrent start in tiled space • Replace one face for each inhibiting pair • Overlapped Tiling – Replace “back-face” • Split Tiling – Replace “front-face”

  6. Concurrent Start: Before Tiling Condition: A boundary that does not carry any dependence

  7. Inter-tile Dependences • Shift vectors • Tile traversal order • Normal to all other hyperplanes • Hyperplane carries dependence • A dependence “pokes” through • Inter-tile dependence vector • Shift vector • Corresponding hyperplane carries dependence

  8. Concurrent Start Inhibition • Concurrent start in original iteration space along a boundary • But that boundary carries an inter-tile dependence A boundary has concurrent start S_j is an inter-tile dependence That boundary carries Inter-tile dependence

  9. Companion Hyperplane • Hyperplane that destroys the inter-tile dependence • Swivel a hyperplane “backward” • Dependences carried by original hyperplane are “neutralized” • Incoming dependences become non-incoming • Outgoing dependences become non-outgoing

  10. Overlapped Tiling • Replace “back face” with companion hyperplane • Additional region is shared with preceding tile • Region of preceding tile that caused the dependence • Each new tile independent of preceding tile (“do-all” parallelism) • Increased computation cost; communication volume

  11. Split Tiling • Replace “front face” with companion hyperplane • Tile split into independent and dependent regions • Execute independent region followed by dependent region • Increased #communications

  12. Experimental Evaluation • Cluster • 2.8 GHz dual-processor Opteron 254 • 1MB L2 cache; 4GB RAM • Linux 2.6.9; Intel compiler (icc) –O3 • Comparison • Two pipelined schedules – along space and time • 1000 time steps • 1 – 32 processors

  13. Pipelined Execution: Parameters 64000 elements; 32 processors Space tile size : 1000 Time tile size : 16

  14. Performance with Problem Size

  15. Weak Scaling • Problem size = #procs * 20000 • Horizontal line – Linear Scaling

  16. Conclusion • Time tiling stencils – crucial for data locality • Might inhibit concurrent execution • Presented: Two approaches to enabling concurrent execution • Ongoing work: Modeling relative benefits of the two approaches

  17. Thank You!

More Related