Agenda

Agenda • Project discussion • Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout,ISCA'10 [pdf] • Benchmarking guidelines • Regular vs. irregular parallel applications

1 Speedup = F 1 - F + 1 N Last time: Amdahl’s law 1-F Under what assumptions? F • Code is infinitely paralelizable • No parallelization overheads • No synchronization

Assuming multiple BCEs. Q: How to design a multicore for maximum speedup Assumed Perf(R) = square root of R Two problems symmetric / asymmetric multicore chips Area allocations (symmetric) Sixteen 1-BCE cores (symmetric) Four 4-BCE cores (symmetric) One 16-BCE core

For Asymmetric Multicore Chips Serial Fraction 1-F same, so time = (1 – F) / Perf(R) Parallel Fraction F One core at rate Perf(R) N-R cores at rate 1 Parallel time = F / (Perf(R) + N - R) Therefore, w.r.t. one base core: 1 Asymmetric Speedup = F 1 - F + Perf(R) Perf(R) + N - R

[for 256 BCEs] (256 cores) (253 cores) (193 cores) (1 core) (241 cores)

Amdahl assumptions • Code is infinitely paralelizable • No parallelization overheads • No synchronization • Add synchronization. Randomly entered (?!) fseq + fpar = 1 fseq + fpar,ncs + fpar,cs = 1

fseq + fpar,ncs + fpar,cs = 1 Average time in critical sections Paper also derives an estimate for max time in critical sections

fseq fpar,csPcsPctn fpar,cs(1-PcsPctn)/N fpar,ncs / N

Speedup for an asymmetric processor as a function of the big core size (b) and small core size (s) for different contention rates, assuming 256 BCEs. Fraction spent in sequential code 1%.

Design space exploration across symmetric, asymmetric and ACS multicore processors Varying the fraction of the time spent in critical sections and their contention rates. Fraction spent in sequential code equals 1% ACS = Accelerated critical section

agenda • Project discussion • Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout,ISCA'10 [pdf] • 12 ways to fool the masses • Regular vs. irregular parallel applications

If you were plowing a field, whichwould you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

David H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991, 1. Quote only 32-bit performance results, not 64-bit results. 2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimized code on Crays. 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. 10. Mutilate the algorithm used in the parallel implementation to match the architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12 If all else fails, show pretty pictures and animated videos, and don't talk about performance.

Rodamap • Project discussion • Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout,ISCA'10 [pdf] • 12 ways to fool the masses • Regular vs. irregular parallel applications

Definitions Regular applications key data structures are vectors dense matrices simple access patterns (eg) array indices are affine functions of for-loop indices examples: MMM, Cholesky & LU factorizations, stencil codes, FFT,… Irregular applications key data structures are lists, priority queues trees, DAGs, graphs usually implemented using pointers or references complex access patterns examples: see next slide 16

Regular application example: Stencil computation (e.g.,) Finite-difference method for solving pde’s discrete representation of domain: grid Values at interior points are updated using values at neighbors values at boundary points are fixed Data structure: dense arrays Parallelism: values at next time step can be computed simultaneously parallelism is not dependent on runtime values Compiler can find the parallelism spatial loops are DO-ALL loops //Jacobi iteration with 5-point stencil //initialize array A for time = 1, nsteps for <i,j> in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for <i,j> in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j) tn tn+1 A temp Jacobi iteration, 5-point stencil 17

Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Don’t-care non-determinism: final mesh depends on order in which bad triangles are processed applications do not care which mesh is produced Data structure: graph in which nodes represent triangles and edges represent triangle adjacencies Parallelism: bad triangles with cavities that do not overlap can be processed in parallel parallelism is dependent on runtime values compilers cannot find this parallelism 18

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA