1 / 22

PL/B: Programming for locality and large scale parallelism

PL/B: Programming for locality and large scale parallelism. George Alm á si Luiz A. DeRose José E. Moreira David A. Padua. Overview. Concepts Examples Thoughts about implementation Conclusions. A programming system for distributed-memory machines: Focus: numerical computing .

yoko
Download Presentation

PL/B: Programming for locality and large scale parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PL/B: Programming for locality and large scale parallelism George Almási Luiz A. DeRose José E. Moreira David A. Padua

  2. Overview • Concepts • Examples • Thoughts about implementation • Conclusions

  3. A programming system for distributed-memory machines: Focus: numerical computing. Convenient to use Flat learning curve Short development cycle Easy debugging, maintenance. Not too difficult to implement No “heroic programming” for the compiler Language extension: General nature 1st implementation using MATLAB™ Programming model: Single thread of execution Explicit data layout, distribution Recursive tiling Data distribution primitives Implementation: Master-slave model PL/B at a glance

  4. Not another programming language! Avoid SPMD Difficult to reason about: Global view of communication and computation is not explicit in SPMD model “4D spaghetti code” MPI is cumbersome: No compiler support “Assembly language of parallel computing”. Avoid complex compilers HPF Avoid OpenMP Wrong abstraction for distributed memory machines Could be implemented on top of Treadmarks™-like systems, but Hard to get efficiency Requires compiler support Untested, experimental Things we didn’t want to do

  5. Technical simplicity: No compiler work needed for prototype Popularity: Programmers of parallel machines are familiar with the MATLAB environment Government interest: Parallel MATLAB is part of PERCS project Evaluation strategy: IBM’s BlueGene/L is an ideal testbed for scalability Novelty: MATLAB™ is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms. The Convergence of PL/B and MATLAB™

  6. Constructing HTAs: Bottom-up: Imposing HTA shape onto a flat array Always homogeneous Top-down: Structure first Contents later Maybe non-homogeneous Matlab™ notation: Similar to cell arrays { } n-dimensional tiled arrays d-dimensional tiles, d ≤ n Tiling is recursive Homogeneity of HTAs: Adjacent tiles are “compatible” along dimensions of adjacency Not all tiles have to have the same shape Tiles can be distributed Across modules of a parallel system. Distribution is always block cyclic Hierarchically Tiled Arrays

  7. x2 = A{:}{2:4,3}(1:2) x3 = A{1}(1:4,1:6) x1 = A{2}{4,3}(3) x4 = A(2,9:11) “flattened” access Creating and Accessing HTAs A = hta {1:2}{1:4,1:3}(1:3)

  8. Blocked: • HTA shape: {1:3,1:3}(1:5,1:4) • Block-cyclic in 2nd dimension: • HTA shape: {1:3,1:6}(1:5,1:2) Distributing HTAs across processors • 3x3 mesh of processors, 15x12 array

  9. PL/B programs are single-threaded and contain array operations on HTAs. The host running PL/B is a front for a distributed machine Processors are arranged in hierarchical meshes. Top levels of HTAs distribute onto a subset of existing nodes. Computation statements: all HTA indices refer to the same (local) physical processor In particular, when all HTA indices are identical, computations are guaranteed to be local Communication: all other statements Some functions and operators encode both communicatoin and computation Typically, MPI-like collective operations Summary: Parallel Communication and Computation in PL/B

  10. Overview • Concepts • Examples • Thoughts about 1st implementation • Conclusions

  11. Tiled Matrix Multiplication for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end end end

  12. c{i,j}, a{i,k}, b{k,j},andT represent HTA tiles (submatrices). The * operator represents matrix multiplication on HTA tiles. Tiled Matrix Multiplication (PL/B) for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; end end

  13. Cannon’s Algorithm (parallel, tiled matrix multiplication)

  14. Cannon’s Algorithm written down in PL/B function [c] = cannon(a,b) % a, b are assumed to be distributed on an n*n grid. % create an n*n distributed hta for matrix c. c{1:n,1:n} = zeros(p,p); % communication % “parallelogram shift” rows of a, columns of b for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); % communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1); % communication end % main loop: parallel multiplications, column shift a, row shift b for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; % computation a{:,:} = cshift(a{:,:},dim=2, shift=1); % communication b{:,:} = cshift(b{:,:},dim=1,shift=1); % communication end end

  15. Sparse Parallel Matrix-Vector Multiply with vector copy P1 P2 A b × P3 A: distributed b: copied P4

  16. Sparse Parallel MVM with vector copy % Distribute a forall i=1:n, c{i} = a(DIST(i):DIST(i+1)-1,:); end % Broadcast vector b v{1:n} = b; % Local multiply (sparse) t{:} = c{:} * v{:}; % Everybody gets copy of result forall i=1:N v{i}= t(:); %flattened t end Important observation: In MATLAB sparse computations can be represented as dense computations. The interpreter only performs the necessary operations.

  17. Overview • Concepts • Examples • Thoughts about implementation • Conclusions

  18. Implementation • A “Distributed Array Virtual Machine” implemented on backed nodes • Multiple types of memory (local, shared, co-arrays etc) • Similar to UPC, OpenMP runtimes • DAVM instruction set (bytecode?) • A MATLAB™ based frontend • The MATLAB interpreter runs the show • HTA code can be compiled into AVM code and distributed to backend • A MATLAB “toolbox” contains the new data types • Possible changes to MATLAB syntax: as few as we can get away • forall

  19. Implementation: MATLAB™-based frontend Matlab @hta directory operators collectives indexing constructors hta * subsref sum tile spread / subsasgn cshift \

  20. Q: is PL/B a toy language? A: it is as expressive as SPMD Subsumes a large part of MPI: a{1} = b{2} is a message sent from rank 2 to 1. x = sum(a{:})  MPI_Reduce x{:} = a  MPI_Bcast Many important algorithms can be formulated “better” Q: Is PL/B still Matlab™ or a new beast? PL/B defines a new data type and operators MATLAB is a polymorphic language: New data type is compatible (drop-in replacement) with existing data types New data types bring new functionality Think “toolbox” – Matlab users are familiar with the concept Porting code to PL/B: Changes are going to be fairly localized The code will keep working during transition Anticipating questions:

  21. Q: Debugging and profiling PL/B A: Debugging PL/B should not be different from debugging a regular MATLAB program. Q: Performance? A: PL/B has a better chance of scaling than a regular MPI program Most communication primitives are high-level and are going to be optimized. Writing low-level communication code in PL/B is possible, but not a natural thing to do Implementation easy for most primitives (MPI) More questions

  22. Conclusion • New and exciting paradigm: • HTA arrays and operators express communication and computation. • Single-threaded code • Master-slave execution model • Anticipate scalability • Minimal to no compiler work needed • About to embark on 1st implementation • Runtime (Distributed Array Virtual Machine) • Interpreted front-end using (unchanged) Matlab™

More Related