1 / 22

Simplifying Parallel Programming with Compiler Transformations

Simplifying Parallel Programming with Compiler Transformations. Matt Frank University of Illinois. What I’m ranting about. Transformations that alleviate tedium Analogous to: code generation, register allocation, and instr. Sched (Not really “optimizations”) Mainly:

edward
Download Presentation

Simplifying Parallel Programming with Compiler Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simplifying Parallel Programming with Compiler Transformations Matt Frank University of Illinois mif@illinois.edu

  2. What I’m ranting about • Transformations that alleviate tedium • Analogous to: • code generation, register allocation, and instr. Sched • (Not really “optimizations”) • Mainly: • Loop distribution, reassociation, “scalar” expansion, inspector-executor, hashing. • Cover much more than you might think • || language expressivity mif@illinois.edu

  3. Assumptions • Cache-coherent shared-memory many-cores • (I’m not addressing distributed memory issues) • Synchronization somewhat expensive • Don’t use barriers gratuitously (but don’t avoid at all costs) • Analysis is not my problem • Programmer annotates • Non-determinism is outside realm of this talk • No race detection in this talk either mif@illinois.edu

  4. Compiler Flow Front-end type systems and whole-program analysis New information: Type systems (e.g. DPJ) Domain-specific objects run-time feedback Dependence Graph (PDG) based compiler Program analysis (info about high level program invariants) for more efficient coherence, checkpointing, q.o.s. Feedback Runtime/Execution platform New capabilities: checkpointing, q.o.s. guarantees. mif@illinois.edu

  5. I’m leaving out locality Front-end type systems and whole-program analysis || exposing transformations Tiling, etc. Runtime/Execution platform mif@illinois.edu

  6. What’s enabled? • Loops that contain arbitrary control flow • Including early exits, arbitrary function calls, etc. • Arbitrary iterators (even sequential ones) • Can’t depend on main body of computation though • Arbitrary combinations of data parallel work, scans and reductions • Can use “partial sums” inside loop • Buffered printf mif@illinois.edu

  7. The transformations • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu

  8. You’ve heard of map-reduce doall i(1..n) private j = f(X[i]) total = total + j shared j[n] doall i(1..n) j[i] = f(X[i]) do i(1..n) total = total + j[i] mif@illinois.edu

  9. How ‘bout scan-map? struct { data; *next; } *p; doall p != NULL modify(p->data) p = p->next n=0 do a[n++] = p p = p->next doall i(0..n) modify(a[i]->data) p = p->next mif@illinois.edu

  10. Sparse matrix construction data rows scan int ptr = 0 shared data[float] shared rows[int] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ row ptr mif@illinois.edu

  11. Partial Sum Expansion scan int ptr = 0 shared float data[m] shared int rows[n] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ scan int ptr[n]# scalar expand ptr shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ expand partial sum mif@illinois.edu

  12. Scalar Expansion scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() and inner loop fission mif@illinois.edu

  13. Outer Loop Fission scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1]+ ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ do row (1..n) rows[row] = rows[row-1] + ptr[row-1] doall row (1..n) for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() mif@illinois.edu

  14. Concatenation data data rows rows row ptr parallel sequential mif@illinois.edu

  15. printf() is same pattern stdout buffer doall i (1..n) private mystring = s(i) printf(mystring) private mystrings mif@illinois.edu

  16. Sparse array updates doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) x[i]+= temp x[j]+= temp mif@illinois.edu

  17. Becomes doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) continue[hash(i)][myproc].push(i,temp) continue[hash(j)][myproc].push(j,temp) doall p(1..P) for t (1..P) private (ptr,val) = continue[p][t] x[ptr] += val 1 2 3 4 1 2 3 4 the continuation matrix -> mif@illinois.edu

  18. Graph updates doall i(1..n) newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, value[pred]) value[i] = newvalue mif@illinois.edu

  19. Inspector Executor Polychronopolous ’88 Saltz ’91 Leung/Zahorjan, ‘93 int wavefront[n] = {0} do i(1..n) wavefront[i] = max(wavefronts[i’s predecessors]) do w(1..maxdepth) doall i suchthat wf[i] = w newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, pred[i]) value[i] = newvalue mif@illinois.edu

  20. Limits of what we know doall node in worklist modify graph structure mif@illinois.edu

  21. What I’ve shown you • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu

  22. Where next? • Relieve Tedium • (build the compiler, or frameworks, or …) • Find new patterns • Delauney triangulation • Pick an example application: there will be something new you wish could be transformed automatically • Parallel languages beyond “doall” and “reduce” mif@illinois.edu

More Related