Static Translation of Stream Programming to a Parallel System

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

Uniprocessor Performance

Picochip PC102 Ambric AM2045 Cisco CSR-1 Intel Tflops Raza XLR Cavium Octeon Raw Cell Niagara Opteron 4P Broadcom 1480 Xeon MP Xbox360 PA-8800 Tanglewood Opteron Power4 PExtreme Power6 Yonah Motivation 512 256 128 64 32 # of cores 16 8 4 2 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20??

512 Picochip PC102 Ambric AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of cores Raza XLR Cavium Octeon Raw 16 8 Cell Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Motivation • For uniprocessors, • C was: • Portable • High Performance • Composable • Malleable • Maintainable Uniprocessors: C is the common machine language

512 Picochip PC102 Ambric AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of cores Raza XLR Cavium Octeon Raw 16 8 Cell Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Motivation What is the common machine language for multicores?

Common Machine Languages Uniprocessors: Multicores: Register Allocation Instruction Selection Instruction Scheduling von-Neumann languages represent the common properties and abstract away the differences Stream Programming Language is a common machine language for multicores

Properties of Stream Programs [W. Thies ‘02] • A large (possibly infinite) amount of data • Limited lifespan of each data item • Little processing of each data item • A regular, static computation pattern • Stream program structure is relatively constant • A lot of opportunities for compiler optimizations

Application of Streaming Programming

Model of Computation AtoD FMDemod • Synchronous Dataflow [Lee ‘92] • Graph of autonomous filters • Communicate via FIFO channels • Static I/O rates [Edward ‘87] • Compiler decides on an orderof execution (schedule) • Static estimation of computation Scatter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Gather Adder Speaker

StreamIt is a novel language for streaming Exposes parallelism and communication Architecture independent Modular and composable Simple structures composed to creates complex graphs Malleable Change program behavior with small modifications StreamIt Language Overview [Thies ‘04] filter pipeline may be any StreamIt language construct splitjoin parallel computation splitter joiner feedback loop splitter joiner

Mapping of Filters to Multicores • Task Parallelism [Edward ‘87] • Fine-Grained Data Parallelism [Michael ‘06] • 3-phase solution [Michael ’06] • Orchestrating the Execution of Stream Programs [Kudlur ‘08]

Baseline 1: Task Parallelism BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop • Inherent task parallelism between two processing pipelines • Task Parallel Model: • Only parallelize explicit task parallelism • Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core Splitter Joiner Adder

Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the example are stateless • Fine-grained Data Parallel Model: • Fiss each stateless filter N ways (N is number of cores) • Remove scatter/gather if possible • We can introduce data parallelism • Example: 4 cores • Each fission group occupies entire machine Splitter Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Joiner Joiner Splitter Splitter Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner Splitter Splitter Process Process Process Process Process Process Process Process Joiner Joiner Splitter Splitter Expand Expand Expand Expand Expand Expand Expand Expand Joiner Joiner BandStop BandStop Splitter Splitter BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner Joiner BandStop Splitter BandStop BandStop Adder Adder Joiner

AdaptDFT AdaptDFT 3-Phase Solution [Michael ‘06] Splitter 6 6 Joiner Data Parallel 20 RectPolar Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff Data Parallel, but too little work! 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner Data Parallel 20 PolarRect Target a 4 core machine

AdaptDFT AdaptDFT Data Parallelize Splitter 6 6 Joiner RectPolar 5 Splitter RectPolar RectPolar 20 RectPolar Joiner Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner RectPolar 5 Splitter RectPolar RectPolar 20 PolarRect Joiner Target a 4 core machine

Data + Task Parallel Execution Splitter 6 6 Cores Joiner 5 Splitter Joiner Splitter Splitter 2 2 Time 21 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner Target 4 core machine

Better Mapping Splitter 6 6 Cores Joiner 5 Splitter Joiner Splitter Splitter 16 2 2 Time 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner Target 4 core machine

RectPolar RectPolar RectPolar RectPolar RectPolar RectPolar Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State • New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning

Greedy Partitioning [Michael ‘06] Cores To Schedule: 16 Time Target 4 core machine

Static Translation of Stream Programs [Proposal] • We study • A mathematical model and algorithms to resolve bottlenecks in stream programs • Map actors of stream programs to processors in a parallel systems • Compute a schedule for each processor • Goal is to statically optimize the throughput of a stream program • Assuming constant input bandwidth

Research Question: Removing the bottleneck from the stream graph A A S B B́ C B C J D D Filter B is duplicated Filter B is the bottleneck After removing the bottleneck Original stream graph

Research Method • Perform a quantitative analysis that detects bottlenecks in the stream graph • The bottleneck resolver duplicates actors that impose a bottleneck. • The process continues until the program is bottleneck free • Then mapping the actors to processors is performed via Integer Linear Programming

Plan • Background study • Research question • Proposal • Implementation • Results • Publication

Question?

Static Translation of Stream Programming to a Parallel System

Static Translation of Stream Programming to a Parallel System

Presentation Transcript

Parallel Programming

Towards a Science of Parallel Programming

PARALLEL programming

Introduction to Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Aspects of practical parallel programming Parallel programming models Data parallel

Static Translation of Stream Programming to a Parallel System

Parallel Programming

Parallel Programming

Static Translation of Stream Programs

Parallel Programming