StreamIt – A Programming Language for the Era of Multicores

StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Today: The Happily ObliviousAverage Joe Programmer • Joe is oblivious about the processor • Moore’s law bring Joe performance • Sufficient for Joe’s requirements • Joe has built a solid boundary between Hardware and Software • High level languages abstract away the processors • Ex: Java bytecode is machine independent • This abstraction has provided a lot of freedom for Joe • Parallel Programming is only practiced by a few experts

Joe the Parallel Programmer • Moore’s law is not bringing anymore performance gains • If Joe needs performance he has to deal with multicores • Joe has to deal with performance • Joe has to deal with parallelism Joe

Why Parallelism is Hard • A huge increase in complexity and work for the programmer • Programmer has to think about performance! • Parallelism has to be designed in at every level • Humans are sequential beings • Deconstructing problems into parallel tasks is hard for many of us • Parallelism is not easy to implement • Parallelism cannot be abstracted or layered away • Code and data has to be restructured in very different (non-intuitive) ways • Parallel programs are very hard to debug • Combinatorial explosion of possible execution orderings • Race condition and deadlock bugs are non-deterministic and illusive • Non-deterministic bugs go away in lab environment and with instrumentation

Compiler-Aware Language Design FMDemod The StreamIt Experience Scatter LPF1 LPF2 LPF3 Gather Speaker

StreamIt Project • Language Semantics / Programmability • StreamIt Language (CC 02) • Programming Environment in Eclipse (P-PHEC 05) • Optimizations / Code Generation • Phased Scheduling (LCTES 03) • Cache Aware Optimization (LCTES 05) • Domain Specific Optimizations • Linear Analysis and Optimization (PLDI 03) • Optimizations for bit streaming (PLDI 05) • Linear State Space Analysis (CASES 05) • Parallelism • Teleport Messaging (PPOPP 05) • Compiling for Communication-Exposed Architectures (ASPLOS 02) • Load-Balanced Rendering (Graphics Hardware 05) • Applications • SAR, DSP benchmarks, JPEG, • MPEG [IPDPS 06], DES and Serpent [PLDI 05], … StreamIt Program Front-end Annotated Java Stream-Aware Optimizations Uniprocessor backend Cluster backend Raw backend IBM X10backend MPI-like C C per tile +msg code Streaming X10 runtime C

Compiler-Aware Language Design boost productivity, enable faster development and rapid prototyping target multicores, clusters, tiled architectures, DSPs, graphics processors, … programmability domain specificoptimizations enable parallel execution simple and effective optimizations for domain specific abstractions

Structured block level diagram describes computation and flow of data Conceptually easy to understand Clean abstraction of functionality Streaming Application Design MPEG bit stream picture type addVLD(QC, PT1, PT2); addsplitjoin { splitroundrobin(NB, V); add pipeline { addZigZag(B); addIQuantization(B)to QC; addIDCT(B); addSaturation(B); } add pipeline { addMotionVectorDecode(); addRepeat(V, N); } joinroundrobin(B, V); } add splitjoin { splitroundrobin(4(B+V), B+V, B+V); addMotionCompensation(4(B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; addChannelUpsample(B); } } join roundrobin(1, 1, 1); } add PictureReorder(3WH) to PT2; add ColorSpaceConversion(3WH); VLD macroblocks, motion vectors quantization coefficients splitter <QC> frequency encoded macroblocks differentially coded motion vectors <PT1, PT2> ZigZag Motion Vector Decode IQuantization <QC> IDCT Repeat Saturation spatially encoded macroblocks motion vectors joiner splitter Cr Cb Y Motion Compensation Motion Compensation Motion Compensation reference picture reference picture reference picture <PT1> <PT1> <PT1> Channel Upsample Channel Upsample joiner recovered picture Picture Reorder <PT2> Color Space Conversion MPEG-2 Decoder

Preserve program structure Natural for application developers to express Leverage program structure to discover parallelism and deliver high performance Programs remain clean Portable and malleable StreamIt Philosophy MPEG bit stream picture type addVLD(QC, PT1, PT2); addsplitjoin { splitroundrobin(NB, V); add pipeline { addZigZag(B); addIQuantization(B)to QC; addIDCT(B); addSaturation(B); } add pipeline { addMotionVectorDecode(); addRepeat(V, N); } joinroundrobin(B, V); } add splitjoin { splitroundrobin(4(B+V), B+V, B+V); addMotionCompensation(4(B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; addChannelUpsample(B); } } join roundrobin(1, 1, 1); } add PictureReorder(3WH) to PT2; add ColorSpaceConversion(3WH); VLD macroblocks, motion vectors quantization coefficients splitter <QC> frequency encoded macroblocks differentially coded motion vectors <PT1, PT2> ZigZag Motion Vector Decode IQuantization <QC> IDCT Repeat Saturation spatially encoded macroblocks motion vectors joiner splitter Cr Cb Y Motion Compensation Motion Compensation Motion Compensation reference picture reference picture reference picture <PT1> <PT1> <PT1> Channel Upsample Channel Upsample joiner recovered picture Picture Reorder <PT2> Color Space Conversion

StreamIt Philosophy MPEG bit stream picture type addVLD(QC, PT1, PT2); addsplitjoin { splitroundrobin(NB, V); add pipeline { addZigZag(B); addIQuantization(B)to QC; addIDCT(B); addSaturation(B); } add pipeline { addMotionVectorDecode(); addRepeat(V, N); } joinroundrobin(B, V); } add splitjoin { splitroundrobin(4(B+V), B+V, B+V); addMotionCompensation(4(B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; addChannelUpsample(B); } } join roundrobin(1, 1, 1); } add PictureReorder(3WH) to PT2; add ColorSpaceConversion(3WH); VLD macroblocks, motion vectors quantization coefficients splitter <QC> frequency encoded macroblocks differentially coded motion vectors <PT1, PT2> ZigZag Motion Vector Decode IQuantization <QC> IDCT Repeat Saturation spatially encoded macroblocks motion vectors joiner splitter Cr Cb Y Motion Compensation Motion Compensation Motion Compensation reference picture reference picture reference picture <PT1> <PT1> <PT1> Channel Upsample Channel Upsample joiner recovered picture Picture Reorder <PT2> Color Space Conversion output to player

Stream Abstractions in StreamIt MPEG bit stream filters addVLD(QC, PT1, PT2); addsplitjoin { splitroundrobin(NB, V); add pipeline { addZigZag(B); addIQuantization(B)to QC; addIDCT(B); addSaturation(B); } add pipeline { addMotionVectorDecode(); addRepeat(V, N); } joinroundrobin(B, V); } add splitjoin { splitroundrobin(4(B+V), B+V, B+V); addMotionCompensation(4(B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; addChannelUpsample(B); } } join roundrobin(1, 1, 1); } add PictureReorder(3WH) to PT2; add ColorSpaceConversion(3WH); VLD splitter <QC> pipelines <PT1, PT2> ZigZag Motion Vector Decode IQuantization <QC> IDCT Repeat Saturation splitjoins joiner splitter Motion Compensation Motion Compensation Motion Compensation reference picture reference picture reference picture <PT1> <PT1> <PT1> Channel Upsample Channel Upsample joiner Picture Reorder <PT2> Color Space Conversion

StreamIt Language Highlights • Filters • Pipelines • Splitjoins • Teleport messaging

Example StreamIt Filter FIR input 0 1 2 3 4 5 6 7 8 9 10 11 output 0 1 floatfloatfilter FIR(int N) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } push(result); pop(); } }

FIR Filter in C • FIR functionality obscured by buffer management details • Programmer must commit to a particular buffer implementation strategy void FIR( int* src, int* dest, int* srcIndex, int* destIndex, int srcBufferSize, int destBufferSize, int N) { float result = 0.0; for (int i = 0; i < N; i++) { result += weights[i] *src[(*srcIndex + i) % srcBufferSize]; } dest[*destIndex]= result; *srcIndex = (*srcIndex + 1) % srcBufferSize; *destIndex = (*destIndex + 1) % destBufferSize; }

Example StreamIt Pipeline • Pipeline • Connect components in sequence • Expose pipeline parallelism Column_iDCTs floatfloatpipeline 2D_iDCT(int N) { add Column_iDCTs(N); add Row_iDCTs(N); } Row_iDCTs

Preserving Program Structure Can be reused for JPEG decoding 1 64 64 1 64 int->int pipeline BlockDecode( portal<InverseQuantisation> quantiserData, portal<MacroblockType> macroblockType) { add ZigZagUnordering(); add InverseQuantization()toquantiserData, macroblockType; add Saturation(-2048, 2047); add MismatchControl(); add 2D_iDCT(8); add Saturation(-256, 255); } quantiserData, quantiserData, From Figures 7-1 and 7-4 of the MPEG-2 Specification (ISO 13818-2, P. 61, 66)

In Contrast: C Code Excerpt EXTERN unsigned char *backward_reference_frame[3]; EXTERN unsigned char *forward_reference_frame[3]; EXTERN unsigned char *current_frame[3]; ...etc... • Explicit for-loops iterate through picture frames • Frames passed through global arrays, handled with pointers • Mixing of parser, motion compensation, and spatial decoding decode_macroblock() { parser(); motion_vectors(); for (comp=0;comp<block_count;comp++) { parser(); Decode_MPEG2_Block(); } } motion_vectors() { parser(); decode_motion_vector parser(); } Decode_Picture { for (;;) { parser() for (;;) { decode_macroblock(); motion_compensation(); if (condition) then break; } } frame_reorder(); } motion_compensation() { for (channel=0;channel<3;channel++) form_component_prediction(); for (comp=0;comp<block_count;comp++) { Saturate(); IDCT(); Add_Block(); } } Decode_MPEG2_Block() { for (int i = 0;; i++) { parsing(); ZigZagUnordering(); inverseQuantization(); if (condition) then break; } }

Example StreamIt Splitjoin • Splitjoin • Connect components in parallel • Expose task parallelism and data distribution floatfloatsplitjoin Row_iDCT(int N) { split roundrobin(N); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(N); } splitter joiner

Example StreamIt Splitjoin floatfloatpipeline 2D_iDCT(int N) { add Column_iDCTs(N); add Row_iDCTs(N); } splitter floatfloatsplitjoin Column_iDCT(int N) { split roundrobin(1); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(1); } iDCT iDCT iDCT iDCT joiner splitter splitter splitter floatfloatsplitjoin Row_iDCT(int N) { split roundrobin(N); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(N); } iDCT iDCT iDCT iDCT joiner joiner joiner

Teleport Messaging • Avoids muddling data streams with control relevant information • Localized interactions in large applications • A scalable alternative to global variables or excessive parameter passing VLD IQ MC MC MC Order

Teleport Messaging Overview void setPicturetype(int p) { reconfigure(p); } • Looks like method call, but timed relative to data in the stream • Simple and precise for user • Exposes dependences to compiler • Adjustable latency • Can send upstream or downstream TargetFilter x; if newPictureType(p) { x.setPictureType(p) @ 0; }

Messaging Equivalent in C The MPEG Bitstream Decode Picture File Parsing Global Variable Space Decode Macroblock Inverse Quantization Decode Block ZigZagUnordering Motion Compensation Saturate Decode Motion Vectors IDCT Motion Compensation For Single Channel Frame Reordering Output Video

Common Machine Languages Unicores: Multicores: Register Allocation Instruction Selection Instruction Scheduling von-Neumann languages represent the common properties and abstract away the differences

Bridging the Abstraction layers • StreamIt exposes the data movement • Graph structure is architecture independent • StreamIt exposes the parallelism • Explicit task parallelism • Implicit but inherent data and pipeline parallelism • Each multicore is different in granularity and topology • Communication is exposed to the compiler • The compiler needs to efficiently bridge the abstraction • Map the computation and communication pattern of the program to the cores, memory and the communication substrate

Types of Parallelism Task Parallelism (traditionally thread fork/join) • Parallelism explicit in algorithm • Between filters withoutproducer/consumer relationship Data Parallelism • Peel iterations of filter, place within scatter/gather pair (fission) • parallelize filters with state Pipeline Parallelism • Between producers and consumers • Stateful filters can be parallelized Scatter Gather Task

Types of Parallelism Task Parallelism (traditionally thread fork/join) • Parallelism explicit in algorithm • Between filters without producer/consumer relationship Data Parallelism (traditionally data parallel loops) • Between iterations of a stateless filter • Place within scatter/gather pair (fission) • Can’t parallelize filters with state Pipeline Parallelism (traditionally in hardware) • Between producers and consumers • Statefullfilters can be parallelized Scatter Data Parallel Gather Scatter Pipeline Gather Data Task

Problem Statement Given: • Stream graph with compute and communication estimate for each filter • Computation and communication resources of the target machine Find: • Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources

Baseline 1: Task Parallelism BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop • Inherent task parallelism between two processing pipelines • Task Parallel Model: • Only parallelize explicit task parallelism • Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core • What about 4, 16, 1024, … cores? Splitter Joiner Adder

Evaluation: Task Parallelism • Raw Microprocessor • 16 inorder, single-issue cores with D$ and I$ • 16 memory banks, each bank with DMA • Cycle accurate simulator Parallelism: Not matched to target! Synchronization: Not matched to target!

Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the example are stateless • Fine-grained Data Parallel Model: • Fiss each stateless filter N ways (N is number of cores) • Remove scatter/gather if possible • We can introduce data parallelism • Example: 4 cores • Each fission group occupies entire machine Splitter Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Joiner Joiner Splitter Splitter Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner Splitter Splitter Process Process Process Process Process Process Process Process Joiner Joiner Splitter Splitter Expand Expand Expand Expand Expand Expand Expand Expand Joiner Joiner BandStop BandStop Splitter Splitter BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner Joiner BandStop Splitter BandStop BandStop Adder Adder Joiner

Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization!

Phase 1: Coarsen the Stream Graph Splitter • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state • Don’t fuse stateless with stateful • Don’t fuse a peeking filter with anything upstream Peek Peek BandPass BandPass Compress Compress Process Process Expand Expand Peek Peek BandStop BandStop Joiner Adder

Phase 1: Coarsen the Stream Graph Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Joiner Adder • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state • Don’t fuse stateless with stateful • Don’t fuse a peeking filter with anything upstream • Benefits: • Reduces global communication and synchronization • Exposes inter-node optimization opportunities

Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Joiner Splitter Adder Adder Adder Fiss 4 ways, to occupy entire chip Adder Joiner

Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass Compress Process Expand BandPass Compress Process Expand Task parallelism! BandPass Compress Process Expand BandPass Compress Process Expand Each fused filter does equal work Fiss each filter 2 times to occupy entire chip Joiner Joiner BandStop BandStop Joiner Splitter Adder Adder Adder Adder Joiner

Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter • Task-conscious data parallelization • Preserve task parallelism • Benefits: • Reduces global communication and synchronization Splitter Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandPass Compress Process Expand BandPass Compress Process Expand Joiner Joiner Splitter Splitter Task parallelism, each filter does equal work BandStop BandStop BandStop BandStop Fiss each filter 2 times to occupy entire chip Joiner Joiner Joiner Splitter Adder Adder Adder Adder Joiner

Evaluation: Coarse-Grained Data Parallelism Good Parallelism! Low Synchronization!

Simplified Vocoder AdaptDFT AdaptDFT Splitter 6 6 Joiner Data Parallel 20 RectPolar Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff Data Parallel, but too little work! 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner Data Parallel 20 PolarRect Target a 4 core machine

Data Parallelize AdaptDFT AdaptDFT Splitter 6 6 Joiner RectPolar 5 Splitter RectPolar RectPolar 20 RectPolar Joiner Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner RectPolar 5 Splitter RectPolar RectPolar 20 PolarRect Joiner Target a 4 core machine

Data + Task Parallel Execution Splitter 6 6 Joiner 5 Splitter Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner Cores Time 21 Target 4 core machine

We Can Do Better! Splitter 6 6 Joiner 5 Splitter Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner Cores 16 Time Target 4 core machine

Phase 3: Coarse-Grained Software Pipelining RectPolar RectPolar RectPolar RectPolar RectPolar RectPolar Prologue New Steady State • New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning

Greedy Partitioning Cores To Schedule: 16 Time Target 4 core machine

Evaluation: Coarse-Grained Task + Data + Software Pipelining Best Parallelism! Lowest Synchronization!

Conclusions • Computer Architecture is at a cross roads • Once in a lifetime opportunity to redesign from scratch • How to use the Moore’s law gains to improve the programmability? • Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges • Half a century of work  still no winning solution • Will affect everyone! • Streaming programming model • Can break the von Neumann bottleneck • A natural fit for a large class of applications • An ideal machine language for multicores. • Compiler can extract explicit and inherent parallelism • Parallelism is abstracted away from architectural details of multicores • Sustainable Speedups (5x to 19x on the 16 core Raw) • Increased abstraction does not have to sacrifice performance http://cag.csail.mit.edu/commit/

StreamIt – A Programming Language for the Era of Multicores

StreamIt – A Programming Language for the Era of Multicores

Presentation Transcript

The X10 Programming Language

The BETA programming language

nesC: A Programming Language for Motes

C Language Programming for the 8051

A Pattern Language for Parallel Programming

The 5W1H of D Programming Language

The Miranda Programming Language

The Scheme Programming Language

Parallel Programming and Timing Analysis on Embedded Multicores

A Study of Garbage Collector Scalability on Multicores

Parallel Programming and Timing Analysis on Embedded Multicores

The Scala Programming Language

Basic of Programming Language

Programming and Timing Analysis of Parallel Programs on Multicores

Choosing a Programming Language

The GG Programming Language

THE PROGRAMMING LANGUAGE OF THE BIBLE

Assembly Language Programming for the MC68HC11

A Network Programming Language

The D Programming Language

StreamIt-A Programming Language for the Era of Multicores

The Scala Programming Language