attacking the programming model wall n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Attacking the programming model wall PowerPoint Presentation
Download Presentation
Attacking the programming model wall

Loading in 2 Seconds...

play fullscreen
1 / 69

Attacking the programming model wall - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Attacking the programming model wall' - lainey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
attacking the programming model wall

Attacking theprogramming model wall

Marco Danelutto

Dept. Computer Science, Univ. of Pisa

Belfast, February 28th 2013

multicores
Multicores
  • Moore law from components to cores
  • Simpler cores, shared memory, cache coherent, full interconnect
manycores
Manycores
  • Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe)
  • Options for cache coherence, more complex inter core communication protocols
slide6
GPUs
  • ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe)
  • Data parallel computations only
slide7
FPGA
  • Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores
  • Non-standard programming tools
power wall 2
Power wall (2)
  • Reducing idle costs
    • E4 CARMA CLUSTER
      • ARM + nVIDIA
      • Spare Watt → GPU
  • Reducing the cooling costs
    • Eurotech AURORA TIGON
      • Intel technology
      • Water cooling
      • Spare Watt → CPU
programming models
Programming models
  • Pros
    • Performance / efficiency
    • Heterogeneous hw targeting
  • Cons
    • Huge application programmer responsibilities
    • Portability (functional, performance)
    • Quantitative parallelism exploitation
  • Pros
    • Expressive power
    • Separation of concerns
    • Qualitative parallelism exploitation
  • Cons
    • Performance / efficiency
    • Hw targeting

Low abstraction level

High abstraction level

separation of concerns
Separation of concerns
  • What has to be computer
  • Function from input data to output data
  • Domain specific
  • Application dependent
  • How the results is computed
  • Parallelism, Power management, Security, Fault Tolerance, …
  • Target hw specific
  • Factorizable

Functional

Non functional

current programming frameworks
Current programmingframeworks

CILK

TBB

OpenMP

MPI

OpenCL

urgencies
Urgencies

Need for

Parallel programming models

Parallel programmers

structured parallel programming
Structured parallel programming
  • From HPC community
  • Started in early ‘90(M. Cole’s PhD thesis)
  • Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls
  • From SW engineering community
  • Started in early ‘00
  • “Recipes” to handle parallelism (name, problem, forces, solutions, …)

Algorithmic skeletons

Parallel design patterns

algorithmic skeletons
Algorithmic skeletons
  • Common, parametric, reusable parallelism exploitation patterns (from HPC community)
  • Exposed to programmers as constructs, library calls, objects, higher order functions, components, ...
  • Composable
    • Two tier model: “stream parallel” skeletons with inner “data parallel” skeletons
sample classical skeletons
Sample classical skeletons
  • Parallel computation of different items from an input stream
  • Task/farm (master/worker), Pipeline
  • Parallel computation on (possibly overlapped) partitions of the same input data
  • Map, Stencil, Reduce, Scan, Mapreduce

Stream parallel

Data Parallel

implementing skeletons
Implementing skeletons
  • Skeleton implemented by instantiating a “concurrent activity graph template”
  • Performance models used to instantiate quantitative parameters
  • P3L, Muesli, SkeTo, FastFlow
  • Skeleton program compiled to macro data flow graphs
  • Rewriting/refactoring compiling process
  • Parallel MDF graph interpreter
  • Muskel, Skipper, Skandium

Template based

Macro Data Flow based

refactoring skeletons
Refactoring skeletons
  • Formally proven rewriting rules

Farm(Δ) = Δ

Pipe(Δ1, Δ2) = SeqComp(Δ1, Δ2)

Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))

sample performance models
Sample performance models
  • Pipeline service time

Maxi=1,k { serviceTime(Stagei)}

  • Pipeline latency

∑i=1,k { serviceTime(Stagei)}

  • Farm service time

max { taskSchedTime, resGathTime, workerTime/#worker}

  • Map latency

partitionTime + workerTime + gatherTime

key strenghts
Key strenghts
  • Full parallel structure of the application exposed to the skeleton framework
    • Exploited by optimizations, support for autonomic non functional concern management
  • Framework responsibility for architecture targeting
    • Write once run everywhere code, with architecture specific compiler and back end (run time) tools
  • Only functional debugging required to application programmers
parallel design patterns
Parallel design patterns
  • Carefully describe a parallelism exploitation pattern including
    • Applicability
    • Forces
    • Possibile implementations/problem solutions
  • As text
  • At different levels of abstraction
patterns
Patterns
  • Collapsed in algorithmic skeletons
    • application programmer → concurrency and algorithm spaces
    • Skeleton implementation (system programmer)→ support structures and implementation mechanisms
structured parallel programming design patterns
Structured parallel programming: design patterns

Design patterns

Follow, learn, use

Problem

Low level code

Programming

tools

structured parallel programming skeletons
Structured parallel programming: skeletons

Skeleton library

Instantiate, compose

High level code

Problem

structured parallel programming1
Structured parallel programming

Design patterns

Skeletons

Use knowledge to instantiate, compose

High level code

Problem

working unstructured
Working unstructured
  • Tradeoffs
    • CPU/GPU threads
    • Processes/Threads
    • Coarse/fine grain tasks
  • Target architecture dependent decisions
thread processes
Thread/processes
  • Creation
    • Thread pool vs. on-the-fly creation
  • Pinning
    • Operating system dendent effectiveness
  • Memory management
    • Embarrassingly parallel patterns may benefit of process memory space separation (see Memory (next) slide)
memory
Memory
  • Cache friendly algorithms
    • Minimization of cache coherency traffic
    • Data aligment/padding
  • Memory wall
    • 1-2 memory interfaces per 4-8 cores
    • 4-8 memory interfaces per 60-64 cores (+internal routing)
synchronization
Synchronization
  • High level, general purpose mechanisms
    • Passive wait
    • High latency
  • Low level mechanisms
    • Active wait
    • Smaller latency
  • Eventually
    • Synchronization on memory (fences)
devising parallelism degree
Devising parallelism degree
  • Ideally
    • As much parallel activities as necessary to sustain the input data rate
  • Base measures
    • Estimated input pressure & task processing time, communication overhead
  • Compile vs. run time choices
    • Try to devise statically some optimal values
    • Adjust initial settings dynamically based on observations
numa memory exploitation
NUMA memory exploitation
  • Auto scheduling
    • Idle workers require tasks from a “global” queue
    • Far nodes require less than near ones
  • Affinity scheduling
    • Tasks scheduled on the producing cores
  • Round robin allocation of dynamically allocated chunks
behavioural skeletons
Behavioural skeletons

Structured parallel algorithm code

exposes

Sensors & Actuators

Sensors: determine

what can be perceived

of the computation

Actuators: determine what

can be affected/changed

in the computation

Autonomic manager: ex-

ecutes a MAPE loop. At

each iteration, and ECA

(Event Condition Action)

rule system is executed using

monitored values and possi-

bly operating actions on the

structured parallel pattern

reads

NFC autonomic manager

ECA rule based program

sample rules
Sample rules
  • Event: inter arrival time changes
  • Condition: faster than service time
  • Action: increase the parallelism degree
  • Event: fault at worker
  • Condition: service time low
  • Action: recruit new worker resource
slide45

Yes, nice, but then ?

We have MPI, OpenMP, Cuda, OpenCL …

fastflow
FastFlow
  • Full C++, skeleton based, streaming parallel processing framework

http://mc-fastflow.sourceforge.net

bring skeletons to your desk
Bring skeletons to your desk
  • Full POSIX/C++ compliancy
    • G++, make, gprof, gdb, pthread, …
  • Reuse existing code
    • Proper wrappers
  • Run from laptops to clusters & clouds
    • Same skeleton structure
basic abstraction ff node
Basic abstraction: ff_node

class RedEye: public ff_node {

intsvc_init(){ … }

void svc_end() { … }

void * svc(void * task) {

Image *in = (Image *)task;

Image * out = ….

return((void *) out);

}

}

basic stream parallel skeletons
Basic stream parallel skeletons
  • Farm(Worker, Nw)
    • Embarrassingly parallel computations on streams
    • Computing Worker in parallel (Nw copies)
    • Emitter + string of workers + Collector implementation
  • Pipeline(Stage1, … , StageN)
    • StageK processes output of Stage(K-1) and delivers to Stage(K+1)
  • Feedback(Skel, Cond)
    • Routes back results from Skel to input or forward to output depending on Cond
setting up a pipeline
Setting up a pipeline

ff_pipelinemyImageProcessingPipe;

ff_nodestartNode = new Reader(…); ff_noderedEye = new RedEye();

ff_node light = new LightCalibration();

ff_node sharpen = new Sharpen();

ff_nodeendNode = new Writer(…);

myImageProcessingPipe.addStage(startNode);

myImageProcessingPipe.addStage(redEye);

myImageProcessingPipe.addStage(light);

myImageProcessingPipe.addStage(sharpen);

myImageProcessingPipe.addStage(endNode);

myImageProcessingPipe.run_and_wait_end();

refactoring farm introduction
Refactoring (farm introduction)

ff_node sharpen = new Sharpen();

ff_farm<> thirdStage;

std::vector<ff_node *> w;

for(inti=0;i<nworkers;++i)

w.push_back(new Sharpen());

farm.add_workers(w);

myImageProcessingPipeaddStage(sharpen);

myImageProcessingPipe.addStage(thirdStage);

refactoring map introduction
Refactoring (map introduction)

ff_farm<> thirdStage;

std::vector<ff_node *> w;

for(inti=0;i<nworkers;++i)

w.push_back(new Sharpen());

farm.add_workers(w);

Emitter em; // scatter data to workers

Collector co; // collect results from w

farm.add_emitter(em);

farm.add_collector(co);

myImageProcessingPipe.addStage(sharpen);

fastflow accelerator
FastFlow accelerator
  • Create a suitable skeleton accelerator
  • Offload tasks from main (sequential) business logic code
  • Accelerator exploits the “spare cores” on your machine
fastflow accelerator1
FastFlow accelerator

ff_farm<> farm(true); // Create accelerator

std::vector<ff_node *> w;

for(inti=0;i<nworkers;++i)

w.push_back(new Worker);

farm.add_workers(w);

farm.add_collector(new Collector);

farm.run_then_freeze();

while(…) {

….

farm.offload(x); // offload tasks

}

while(farm.load_result(&res)) {

….// eventually process results

}

gpu offloading
GPU offloading
  • Performance modelling of percentage of tasks to offload (in a map or in a reduce)
moving to mic
Moving to MIC
  • FastFlow ported to Tilera Pro64
    • (paper this afternoon)
  • With minimal intervention to ensure functional portability
  • And some more changes made to support better hw optimizations
slide69

Thanks to Marco Aldinucci, Massimo Torquati,

Peter Kilpatrick, Sonia Campa, Giorgio Zoppi,

Daniele Buono, Silvia Lametti, Tudor Serban

Any questions?

marcod@di.unipi.it