The coming multicore revolution what does it mean for programming
Download
1 / 42

The Coming Multicore Revolution: What Does it Mean for Programming? - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

The Coming Multicore Revolution: What Does it Mean for Programming?. Kathy Yelick NERSC Director, LBNL EECS Professor, U.C. Berkeley. Software Issues at Scale. Power concerns will dominates all others; Concurrency is the most significant knob we have: lower clock, increase parallelism

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Coming Multicore Revolution: What Does it Mean for Programming?' - naava


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The coming multicore revolution what does it mean for programming

The Coming Multicore Revolution: What Does it Mean for Programming?

Kathy Yelick

NERSC Director, LBNL

EECS Professor, U.C. Berkeley


Software issues at scale
Software Issues at Scale Programming?

  • Power concerns will dominates all others;

    • Concurrency is the most significant knob we have: lower clock, increase parallelism

    • Power density and facility energy

  • Summary Issues for Software

    • 1EF system: Billion-way concurrency, O(1K) cores per chip

    • 1 PF system: millions of threads and O(1K) cores per chip

    • The memory capacity/core ratio may drop significantly

    • Faults will become more prevalent

    • Flops are cheap relative to data movement

    • “Core” is a retro term: “sea of functional units”

      • 1 thread per functional unit

      • Many functional units per thread (SIMD, vectors)

      • Many threads per functional unit (Niagra)


Dram component density is only doubling every 3 years
DRAM component density is Programming?only doubling every 3 years

Source: IBM

3

1 Mayl 2008

Sequoia Programming Models


Two parallel language questions
Two Parallel Language Questions Programming?

  • What is the parallel control model?

  • What is the model for sharing/communication?

    implied synchronization for message passing, not shared memory

data parallel

(singe thread of control)

dynamic

threads

single program

multiple data (SPMD)

receive

store

send

load

shared memory

message passing



What s wrong with mpi everywhere
What’s Wrong with MPI Everywhere Programming?

  • We can run 1 MPI process per core

    • This works now (for CMPs) and will work for a while

    • But 100x faster bandwidth and lower latency within a chip

  • How long will it continue working?

    • 4 - 8 cores? Probably. 128 - 1024 cores? Probably not.

    • Depends on bandwidth and memory scaling

  • What is the problem?

    • Latency: some copying required by semantics

    • Memory utilization: partitioning data for separate address space requires some replication

      • How big is your per core subgrid? At 10x10x10, over 1/2 of the points are surface points, probably replicated

    • Memory bandwidth: extra state means extra bandwidth

    • Weak scaling: success model for the “cluster era;” will not be for the many core era -- not enough memory per core

    • Heterogeneity: MPI per CUDA thread-block?

  • Advantage: no new apps work; modest infrastructure work (multicore-optimized MPI)


What about mixed mpi and threads
What about Mixed MPI and Threads? Programming?

  • Threads: OpenMP, PThreads,…

  • Problems

    • Will OpenMP performance scale with the number of cores / chip?

    • More investment in infrastructure than MPI, but can leverage existing technology

    • Do people want two programming models?

    • Doesn’t go far enough

      • Thread context is a large amount of state compared to vectors/streams

      • Op per instruction vs. 64 ops per instruction


Pgas languages why use 2 programming models when 1 will do
PGAS Languages: Why use 2 Programming Models when 1 will do? Programming?

  • Global address space: thread may directly read/write remote data

  • Partitioned: data is designated as local or global

x: 1

y:

x: 5

y:

x: 7

y: 0

Global address space

l:

l:

l:

g:

g:

g:

p0

p1

pn

  • Remote put and get: never have to say “receive”

    • Remote function invocation? See HPCS languages

  • No less scalable than MPI! (see previous talks)

  • Permits sharing, whereas MPI rules it out!

  • One model rather than two, but if you insist on two:

    • Can call UPC from MPI and vice verse (tested and used)


Sharing and communication models pgas vs mpi
Sharing and Communication Models: Programming?PGAS vs. MPI

host

CPU

two-sided message

  • A one-sided put/get message can be handled directly by a network interface with RDMA support

    • Avoid interrupting the CPU or storing data from CPU (preposts)

  • A two-sided messages needs to be matched with a receive to identify memory address to put data

    • Offloaded to Network Interface in networks like Quadrics

    • Need to download match tables to interface (from host)

message id

data payload

network

interface

one-sided put message

memory

address

data payload

Joint work with Dan Bonachea


Performance advantage of one sided communication
Performance Advantage of Programming?One-Sided Communication

  • The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc)

  • This is faster for pure data transfers than two-sided send/receive


3d fft on bluegene p
3D FFT on BlueGene/P Programming?


Pgas languages for multicore

Shared partitioned on-chip Programming?

x: 1

y:

x: 5

y:

x: 7

y: 0

PGAS Languages for Multicore

DMA

  • PGAS languages are a good fit to shared memory machines, including multicore

    • Global address space implemented as reads/writes

    • Also may be exploited for processor with explicit local store rather than cache, e.g., Cell, GPUs,…

  • Open question in architecture

    • Cache-coherence shared memory

    • Software-controlled local memory (or hybrid)

Private on-chip

m:

l:

Shared off-chip DRAM



To virtualize or not
To Virtualize or Not Programming?

  • The fundamental question facing in parallel programming models is:

    What should be virtualized?

  • Hardware has finite resources

    • Processor count is finite

    • Registers count is finite

    • Fast local memory (cache and DRAM) size is finite

    • Links in network topology are generally < n2

  • Does the programming model (language+libraries) expose this or hide it?

    • E.g., one thread per core, or many?

      • Many threads may have advantages for load balancing, fault tolerance and latency-hiding

      • But one thread is better for deep memory hierarchies

  • How to get the most out of your machine?


Avoid global synchronization
Avoid Global Synchronization Programming?

  • Bulk-synchronous programming has too much synchronization

  • Bad for performance

    • Linpack performance

    • On Multicore / SMP (left, Dongarra et al) and distributed memory (right, UPC)

  • Also bad direction for fault tolerance

  • Dynamic load balancing: use it when the problem demands it, not gratuitously


Lessons learned
Lessons Learned Programming?

  • One-sided communication is faster than 2-sided

    • All about overlap and overhead (FFT example)

  • Global address space can ease programming

    • Dynamic scheduling can tolerate latencies

    • More adaptive and easier to use (fewer knobs)

  • Principled scheduling that takes into account

    • Critical Path, Memory use, Cache, etc.

  • Combination of dynamic load balancing with locality control has new challenges

    • Previous work solve load balancing (Cilk) or locality (MPI) but not both together


Autotuning extreme performance programming
Autotuning: Extreme Performance Programming Programming?

  • Automatic performance tuning

    • Use machine time in place of human time for tuning

    • Search over possible implementations

    • Use performance models to restrict search space

  • Programmers should write programs to generate code, not the code itself

  • Autotuning finds a good performance solution be heuristics or exhaustive search

    • Perl script generates many versions

    • Generate SIMD-optimized kernels

    • Autotuner anlyzes/runs kernels

    • Uses search and heuristics

  • PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices

    • Multicore optimizations not yet in OSKI release


Stencil code overview
Stencil Code Overview Programming?

void stencil3d(double A[], double B[], int nx, int ny, int nz) {

for all grid indices in x-dim {

for all grid indices in y-dim {

for all grid indices in z-dim {

B[center] = S0* A[center] +

S1*(A[top] + A[bottom] +

A[left] + A[right] +

A[front] + A[back]);

}

}

}

}

  • 3D, 7-point, Jacobi iteration on a 2563 grid

  • Flop:Byte Ratio:

    • 0.33 (write allocate), 0.5 (Ideal)


Naive stencil performance

Intel Xeon (Clovertown) Programming?

AMD Opteron (Barcelona)

Sun Niagara2 (Victoria Falls)

Naive Stencil Performance

  • For this simple code - all cache-based platforms show poor efficiency and scalability

  • Could lead programmer to believe that approaching a resource limit


Fully tuned performance

Intel Xeon (Clovertown) Programming?

AMD Opteron (Barcelona)

Sun Niagara2 (Victoria Falls)

Fully-Tuned Performance

1.9x

5.4x

Optimizations include:

  • NUMA-Aware

  • Padding

  • Unroll/Reordering

  • Thread/Cache Blocking

  • Prefetching

  • SIMDization

  • Cache Bypass

  • Different optimizations have dramatic effects on different architectures

  • Largest optimization benefit seen for the largest core count

12.5x


Stencil results

Single Precision Programming?

Double Precision

Performance

Power Efficiency

Stencil Results


Autotuning gives better performance but is it good performance

Autotuning gives Programming?better performance…but is it good performance?


The roofline performance model
The Roofline Performance Model Programming?

128

64

32

16

attainable Gflop/s

8

4

2

1

1/8

1/4

1/2

1

2

4

8

16

flop:DRAM byte ratio


The roofline performance model1

Log scale ! Programming?

Log scale !

The Roofline Performance Model

128

64

32

16

attainable Gflop/s

8

4

2

1

1/8

1/4

1/2

1

2

4

8

16

flop:DRAM byte ratio


The roofline performance model2

256.0 Programming?

Generic Machine

128.0

64.0

32.0

16.0

attainable Gflop/s

8.0

4.0

2.0

1.0

0.5

1/8

1/4

1/2

1

2

4

8

16

actual flop:byte ratio

The Roofline Performance Model

  • The top of the roof is determined by peak computation rate (Double Precision floating point, DP for these algorithms)

  • The instruction mix, lack of SIMD operations, ILP or failure to use other features of peak will lower attainable

peak DP

mul / add imbalance

w/out SIMD

w/out ILP


The roofline performance model3

256.0 Programming?

Generic Machine

128.0

64.0

32.0

16.0

attainable Gflop/s

8.0

4.0

2.0

1.0

0.5

1/8

1/4

1/2

1

2

4

8

16

actual flop:byte ratio

The Roofline Performance Model

  • The sloped part of the roof is determined by peak DRAM bandwidth (STREAM)

  • Lack of proper prefetch, ignoring NUMA, or other things will reduce attainable bandwidth

peak DP

mul / add imbalance

w/out SIMD

peak BW

w/out ILP

w/out SW prefetch

w/out NUMA


The roofline performance model4

256.0 Programming?

Generic Machine

128.0

64.0

32.0

16.0

attainable Gflop/s

8.0

4.0

2.0

1.0

0.5

1/8

1/4

1/2

1

2

4

8

16

actual flop:byte ratio

The Roofline Performance Model

  • Locations of posts in the building are determined by algorithmic intensity

  • Will vary across algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques

peak DP

mul / add imbalance

w/out SIMD

peak BW

w/out ILP

w/out SW prefetch

w/out NUMA


Na ve roofline model

Peak Gflop/s Programming?

Gflop/s(AI) = min

StreamBW * AI

Naïve Roofline Model

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Unrealistically optimistic model

  • Hand optimized Stream BW benchmark

peak DP

peak DP

64

64

32

32

16

16

attainable Gflop/s

attainable Gflop/s

hand optimized Stream BW

hand optimized Stream BW

8

8

4

4

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

64

64

peak DP

32

32

peak DP

hand optimized Stream BW

hand optimized Stream BW

16

16

attainable Gflop/s

attainable Gflop/s

8

8

4

4

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Roofline model for stencil out of the box code
Roofline model for Stencil Programming?(out-of-the-box code)

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Large datasets

  • 2 unit stride streams

  • No NUMA

  • Little ILP

  • No DLP

  • Far more adds than multiplies (imbalance)

  • Ideal flop:byte ratio 1/3

  • High locality requirements

  • Capacity and conflict misses will severely impair flop:byte ratio

peak DP

peak DP

64

64

mul/add imbalance

mul/add imbalance

32

32

w/out SIMD

w/out SIMD

16

16

attainable Gflop/s

attainable Gflop/s

dataset fits in snoop filter

8

8

w/out ILP

4

4

w/out ILP

w/out SW prefetch

w/out NUMA

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

64

64

peak DP

32

32

peak DP

w/out FMA

16

16

attainable Gflop/s

attainable Gflop/s

25% FP

bank conflicts

w/out SW prefetch

w/out ILP

8

8

w/out NUMA

w/out NUMA

12% FP

4

4

w/out SIMD

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Roofline model for stencil out of the box code1
Roofline model for Stencil Programming?(out-of-the-box code)

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Large datasets

  • 2 unit stride streams

  • No NUMA

  • Little ILP

  • No DLP

  • Far more adds than multiplies (imbalance)

  • Ideal flop:byte ratio 1/3

  • High locality requirements

  • Capacity and conflict misses will severely impair flop:byte ratio

peak DP

peak DP

64

64

mul/add imbalance

mul/add imbalance

32

32

w/out SIMD

w/out SIMD

16

16

attainable Gflop/s

attainable Gflop/s

dataset fits in snoop filter

8

8

w/out ILP

4

4

w/out ILP

w/out SW prefetch

w/out NUMA

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

No naïve SPE

implementation

64

64

peak DP

32

32

peak DP

w/out FMA

16

16

attainable Gflop/s

attainable Gflop/s

25% FP

bank conflicts

w/out SW prefetch

w/out ILP

8

8

w/out NUMA

w/out NUMA

12% FP

4

4

w/out SIMD

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Roofline model for stencil out of the box code2
Roofline model for Stencil Programming?(out-of-the-box code)

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Large datasets

  • 2 unit stride streams

  • No NUMA

  • Little ILP

  • No DLP

  • Far more adds than multiplies (imbalance)

  • Ideal flop:byte ratio 1/3

  • High locality requirements

  • Capacity and conflict misses will severely impair flop:byte ratio

peak DP

peak DP

64

64

mul/add imbalance

mul/add imbalance

32

32

w/out SIMD

w/out SIMD

16

16

attainable Gflop/s

attainable Gflop/s

dataset fits in snoop filter

8

8

w/out ILP

4

4

w/out ILP

w/out SW prefetch

w/out NUMA

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

No naïve SPE

implementation

64

64

peak DP

32

32

peak DP

w/out FMA

16

16

attainable Gflop/s

attainable Gflop/s

25% FP

bank conflicts

w/out SW prefetch

w/out ILP

8

8

w/out NUMA

w/out NUMA

12% FP

4

4

w/out SIMD

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Roofline model for stencil numa cache blocking unrolling prefetch
Roofline model for Stencil Programming?(NUMA, cache blocking, unrolling, prefetch, …)

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

  • Clovertown has huge caches but is pinned to lower BW ceiling

  • Cache management is essential when capacity/thread is low

peak DP

peak DP

64

64

mul/add imbalance

mul/add imbalance

32

32

w/out SIMD

w/out SIMD

16

16

attainable Gflop/s

attainable Gflop/s

dataset fits in snoop filter

8

8

w/out ILP

4

4

w/out ILP

w/out SW prefetch

w/out NUMA

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

No naïve SPE

implementation

64

64

peak DP

32

32

peak DP

w/out FMA

16

16

attainable Gflop/s

attainable Gflop/s

25% FP

bank conflicts

w/out SW prefetch

w/out ILP

8

8

w/out NUMA

w/out NUMA

12% FP

4

4

w/out SIMD

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Roofline model for stencil simdization cache bypass
Roofline model for Stencil Programming?(SIMDization + cache bypass)

Intel Xeon E5345

(Clovertown)

Opteron 2356

(Barcelona)

128

128

  • Make SIMDization explicit

  • Use cache bypass instruction: movntpd

  • Increases flop:byte ratio to ~0.5 on x86/Cell

peak DP

peak DP

64

64

mul/add imbalance

mul/add imbalance

32

32

w/out SIMD

w/out SIMD

16

16

attainable Gflop/s

attainable Gflop/s

dataset fits in snoop filter

8

8

w/out ILP

4

4

w/out ILP

w/out SW prefetch

w/out NUMA

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio

Sun T2+ T5140

(Victoria Falls)

IBM QS20

Cell Blade

128

128

64

64

peak DP

32

32

peak DP

w/out FMA

16

16

attainable Gflop/s

attainable Gflop/s

25% FP

bank conflicts

w/out SW prefetch

w/out ILP

8

8

w/out NUMA

w/out NUMA

12% FP

4

4

w/out SIMD

2

2

1

1

1/16

1/8

1/4

1/2

1

2

4

8

1/16

1/8

1/4

1/2

1

2

4

8

flop:DRAM byte ratio

flop:DRAM byte ratio


Autotuning extreme performance programming1
Autotuning: Extreme Performance Programming Programming?

  • Automatic performance tuning

    • Use machine time in place of human time for tuning

    • Search over possible implementations

    • Use performance models to restrict search space

  • Programmers should write programs to generate code, not the code itself

  • Autotuning finds a good performance solution be heuristics or exhaustive search

    • Perl script generates many versions

    • Generate SIMD-optimized kernels

    • Autotuner anlyzes/runs kernels

    • Uses search and heuristics

  • PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices

Block size (n0 x m0) for dense matrix-matrix multiply


Rethinking algorithms count data movement not flops

Rethinking Algorithms Programming?Count data movement, not Flops


Latency and bandwidth avoiding
Latency and Bandwidth-Avoiding Programming?

  • Communication is limiting factor for scalability

    • Movement of data within memory hierarchy

    • Movement of data between cores/nodes

  • Two aspects of communication

    • Latency: number of discrete events (cache misses, messages)

    • Bandwidth: volume of data moved

  • More efficient algorithms tend to run less efficiently

    • Algorithm efficiency: E.g., O(n) ops on O(n) data

    • Efficiency of computation on machine: % of peak

    • At scale, cannot afford to be algorithmically inefficient

    • Need to run low compute intensity algorithms well, or flops are wasted

Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden


Avoiding communication in sparse iterative solvers
Avoiding Communication in Sparse Iterative Solvers Programming?

  • Consider Sparse Iterative Methods for Ax=b

    • Use Krylov Subspace Methods like GMRES, CG

    • Can we lower the communication costs?

      • Latency of communication, i.e., reduce # messages by computing Ak*x with one read of remote x

      • Bandwidth to memory hierarchy, i.e., compute A

  • Example: Inner loop is sparse matrix-vector multiply, Ax (=nearest neighbor computation on a graph)

  • Partition into cache blocks or by

    processor, and take multiple steps

  • Simple examples, A is matrix of:

    • For simplicity of presentation only:

    • 1D mesh mesh “3 point stencil”

Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden


A Programming?8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Locally Dependent Entries for [x,Ax], A tridiagonal

2 processors

Proc 1 Proc 2

Can be computed without communication

Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden


A Programming?8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Locally Dependent Entries for [x,Ax,A2x], A tridiagonal

2 processors

Proc 1 Proc 2

Can be computed without communication

Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden


Avoiding communication in sparse linear algebra summary
Avoiding Communication in Sparse Linear Algebra - Summary Programming?

  • Take k steps of Krylov subspace method

    • GMRES, CG, Lanczos, Arnoldi

    • Assume matrix “well-partitioned,” with modest surface-to-volume ratio

    • Parallel implementation

      • Conventional: O(k log p) messages

      • “New”: O(log p) messages - optimal

    • Serial implementation

      • Conventional: O(k) moves of data from slow to fast memory

      • “New”: O(1) moves of data – optimal

  • Can incorporate some preconditioners

    • Hierarchical, semiseparable matrices …

  • Lots of speed up possible (modeled and measured)

    • Price: some redundant computation


Why all our problems are solved for dense linear algebra in theory
Why all our problems are solved for dense linear algebra– in theory

  • Thm (Demmel, Dumitriu, Holtz, Kleinberg) (Numer.Math. 2007)

    • Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0.

      • Current record:   2.38

  • Thm (Demmel, Dumitriu, Holtz) (Numer. Math. 2008)

    • Given any stable matmul running in O(n+) ops, can do backward stable dense linear algebra in O(n+) ops:

      • GEPP, QR (recursive)

      • rank revealing QR (randomized)

      • (Generalized) Schur decomposition, SVD (randomized)

  • Also reduces communication to O(n+)

  • But constants?


How to waste an exascale machine
How to Waste an Exascale Machine in theory

  • Ignore Little’s Law (waste bandwidth)

  • Over-synchronize (unnecessary barriers)

  • Over-synchronize communication (two-sided vs. one-sided)

  • Waste bandwidth: ignore locality

  • Use algorithms that minimize flops rather than data movement

  • Add a high-overhead runtime system when you don’t need it


ad