PL/B: Programming for locality and large scale parallelism

PL/B: Programming for locality and large scale parallelism George Almási Luiz A. DeRose José E. Moreira David A. Padua

Overview • Concepts • Examples • Thoughts about implementation • Conclusions

A programming system for distributed-memory machines: Focus: numerical computing. Convenient to use Flat learning curve Short development cycle Easy debugging, maintenance. Not too difficult to implement No “heroic programming” for the compiler Language extension: General nature 1st implementation using MATLAB™ Programming model: Single thread of execution Explicit data layout, distribution Recursive tiling Data distribution primitives Implementation: Master-slave model PL/B at a glance

Not another programming language! Avoid SPMD Difficult to reason about: Global view of communication and computation is not explicit in SPMD model “4D spaghetti code” MPI is cumbersome: No compiler support “Assembly language of parallel computing”. Avoid complex compilers HPF Avoid OpenMP Wrong abstraction for distributed memory machines Could be implemented on top of Treadmarks™-like systems, but Hard to get efficiency Requires compiler support Untested, experimental Things we didn’t want to do

Technical simplicity: No compiler work needed for prototype Popularity: Programmers of parallel machines are familiar with the MATLAB environment Government interest: Parallel MATLAB is part of PERCS project Evaluation strategy: IBM’s BlueGene/L is an ideal testbed for scalability Novelty: MATLAB™ is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms. The Convergence of PL/B and MATLAB™

Constructing HTAs: Bottom-up: Imposing HTA shape onto a flat array Always homogeneous Top-down: Structure first Contents later Maybe non-homogeneous Matlab™ notation: Similar to cell arrays { } n-dimensional tiled arrays d-dimensional tiles, d ≤ n Tiling is recursive Homogeneity of HTAs: Adjacent tiles are “compatible” along dimensions of adjacency Not all tiles have to have the same shape Tiles can be distributed Across modules of a parallel system. Distribution is always block cyclic Hierarchically Tiled Arrays

x2 = A{:}{2:4,3}(1:2) x3 = A{1}(1:4,1:6) x1 = A{2}{4,3}(3) x4 = A(2,9:11) “flattened” access Creating and Accessing HTAs A = hta {1:2}{1:4,1:3}(1:3)

Blocked: • HTA shape: {1:3,1:3}(1:5,1:4) • Block-cyclic in 2nd dimension: • HTA shape: {1:3,1:6}(1:5,1:2) Distributing HTAs across processors • 3x3 mesh of processors, 15x12 array

PL/B programs are single-threaded and contain array operations on HTAs. The host running PL/B is a front for a distributed machine Processors are arranged in hierarchical meshes. Top levels of HTAs distribute onto a subset of existing nodes. Computation statements: all HTA indices refer to the same (local) physical processor In particular, when all HTA indices are identical, computations are guaranteed to be local Communication: all other statements Some functions and operators encode both communicatoin and computation Typically, MPI-like collective operations Summary: Parallel Communication and Computation in PL/B

Overview • Concepts • Examples • Thoughts about 1st implementation • Conclusions

Tiled Matrix Multiplication for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end end end

c{i,j}, a{i,k}, b{k,j},andT represent HTA tiles (submatrices). The * operator represents matrix multiplication on HTA tiles. Tiled Matrix Multiplication (PL/B) for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; end end

Cannon’s Algorithm (parallel, tiled matrix multiplication)

Cannon’s Algorithm written down in PL/B function [c] = cannon(a,b) % a, b are assumed to be distributed on an n*n grid. % create an n*n distributed hta for matrix c. c{1:n,1:n} = zeros(p,p); % communication % “parallelogram shift” rows of a, columns of b for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); % communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1); % communication end % main loop: parallel multiplications, column shift a, row shift b for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; % computation a{:,:} = cshift(a{:,:},dim=2, shift=1); % communication b{:,:} = cshift(b{:,:},dim=1,shift=1); % communication end end

Sparse Parallel Matrix-Vector Multiply with vector copy P1 P2 A b × P3 A: distributed b: copied P4

Sparse Parallel MVM with vector copy % Distribute a forall i=1:n, c{i} = a(DIST(i):DIST(i+1)-1,:); end % Broadcast vector b v{1:n} = b; % Local multiply (sparse) t{:} = c{:} * v{:}; % Everybody gets copy of result forall i=1:N v{i}= t(:); %flattened t end Important observation: In MATLAB sparse computations can be represented as dense computations. The interpreter only performs the necessary operations.

Overview • Concepts • Examples • Thoughts about implementation • Conclusions

Implementation • A “Distributed Array Virtual Machine” implemented on backed nodes • Multiple types of memory (local, shared, co-arrays etc) • Similar to UPC, OpenMP runtimes • DAVM instruction set (bytecode?) • A MATLAB™ based frontend • The MATLAB interpreter runs the show • HTA code can be compiled into AVM code and distributed to backend • A MATLAB “toolbox” contains the new data types • Possible changes to MATLAB syntax: as few as we can get away • forall

Implementation: MATLAB™-based frontend Matlab @hta directory operators collectives indexing constructors hta * subsref sum tile spread / subsasgn cshift \

Q: is PL/B a toy language? A: it is as expressive as SPMD Subsumes a large part of MPI: a{1} = b{2} is a message sent from rank 2 to 1. x = sum(a{:})  MPI_Reduce x{:} = a  MPI_Bcast Many important algorithms can be formulated “better” Q: Is PL/B still Matlab™ or a new beast? PL/B defines a new data type and operators MATLAB is a polymorphic language: New data type is compatible (drop-in replacement) with existing data types New data types bring new functionality Think “toolbox” – Matlab users are familiar with the concept Porting code to PL/B: Changes are going to be fairly localized The code will keep working during transition Anticipating questions:

Q: Debugging and profiling PL/B A: Debugging PL/B should not be different from debugging a regular MATLAB program. Q: Performance? A: PL/B has a better chance of scaling than a regular MPI program Most communication primitives are high-level and are going to be optimized. Writing low-level communication code in PL/B is possible, but not a natural thing to do Implementation easy for most primitives (MPI) More questions

Conclusion • New and exciting paradigm: • HTA arrays and operators express communication and computation. • Single-threaded code • Master-slave execution model • Anticipate scalability • Minimal to no compiler work needed • About to embark on 1st implementation • Runtime (Distributed Array Virtual Machine) • Interpreted front-end using (unchanged) Matlab™

PL/B: Programming for locality and large scale parallelism

PL/B: Programming for locality and large scale parallelism

Presentation Transcript

Bioreactor Scale-Up and CFD Review 12

CS 2110: Parallel Programming

Optimal Recovery from Large-Scale Failures in IP Networks Qiang Zheng, Guohong Cao, Tom La Porta, and Ananthram Swami

Programmability

Parallelism

Multi-core Programming

Parallelism

Large-scale Single-pass k-Means Clustering at Scale

Programming Explicit Thread-level Parallelism

CSC 213 – Large Scale Programming

Parallel programming in Java

Concurrent programming and MEF

CS 267 Sources of Parallelism and Locality in Simulation

Tornado: Maximizing Locality and Concurrency in a SMMP OS

Sources of Parallelism and Locality in Simulation

The Coming Multicore Revolution: What Does it Mean for Programming?

CS4961 Parallel Programming Lecture 12: Data Locality, cont. Mary Hall September 30, 2010

X10: Performance and Productivity at Scale

Large-scale organisations in context

Graph Laplacian Regularization for Large-Scale Semidefinite Programming