A Case for Teaching Parallel Programming to Freshmen

A Case for Teaching Parallel Programming to Freshmen Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009

One view of parallel programming • Multicores are coming (have come) • Performance gains no longer automatic and transparent • Most programmers have never written a parallel program • Different models for exploiting parallelism, depending upon the application • Data parallel, Threads, TM, Map-Reduce, … How to migrate my software How to get performance How to educate my programmers It is all about performance

Another view of parallel programming • Every gadget is concurrent and reactive • Many weakly interrelated tasks happening concurrently • cell phones- playing music, receiving calls, web browsing • Hither to independent programs are required to interact • What should the music player do when you are browsing the web • Ambiguous specs: Not clear a priori what a user wants • Infrastructure is a parallel database for processing queries and commands • Scalable infrastructure to deal with ever increasing queries • The database is more than just records -- Many streams of data constantly being fed in • Each interaction requires many queries and transactions Even though the substrate is multicore, performance is a secondary issue Parallelism is obvious but interactions between modules can be complex even when infrequent

My take • Modeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficiently • Freshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)

The remainder of the talk • Parallel programming can be simpler than sequential programming for inherently parallel computations • Some untested ideas on what we should teach Freshman

Parallel programming can be easier than sequential programming

Parse + CAVLC Inverse Quant Transformation NAL unwrap Compressed Bits Inter Prediction Deblock Filter Intra Prediction Frames Ref Frames H.264 Video Decoder Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps) May be implemented in hardware or software depending upon ...

Sequential code from ffmpeg NAL 20K Lines of C out of 200K voidh264decode(){ intstage = S_NAL; while(!eof()){ createdOutput = 0; stallFromInterPred = 0; case(stage){ S_NAL: try_NAL(); stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } } Parse IQ/IT The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial) Inter- Predict Intra- Predict Deblocking

Price of obscuring the parallelism • Program structure is difficult to understand • Packets are kept and modified in a global heap (nothing to do with the logical structure) • Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques Thread-level data parallelism?

Sleeping threads NAL thread Parse thread DeBlk thread Intrapr thread IQ/IT thread Interpredict thread Processors P Threads • A (p)thread of each block • But there is no control over mapping intmain(){ pthread_create(NAL); phtread_create(Parse); pthread_create(IQIT); pthread_create(Interpred); pthread_create(Intrapred); pthread_create(Deblock);} This is an implementation model

StreamIT (Amarasinghe & Thies)a more natural expression using filters NAL • bit -> frame pipelineH264Decode { • add; NAL(); • add; Parse(); • add; IQIT(); • add; feedbackloop{ • join roundrobin; • body pipeline{ • add; InterPredict(); • add; IntraPredict(); • add; Deblock();} • split roundrobin;}} Parse Feedback is Problematic! IQ/IT Inter- Predict Intra- Predict Given the required rates StreamIt compiler can do a great job of generating efficient code Deblocking

Functional languages (pH)Natural expression of parallelism but too general • do_H264 :: Stream Chunk -> Stream Frame • do_H264 = let • fMem :: IStructFrameMem MacroBlock • fMem = makeIStructureMemory • nalStream = nal inputStream • parseStream = parse nalStream • iqitStream = iqit parseStream • interStream = inter iqitStream fMem • intraStream = intra interStream • deblockStream = deblock intraStream fMem • in deblockStream FLs provide a solid base for building domain-specific parallel languages The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler

An Idea we are testing: Hardware-design inspired parallel programming

Hardware-design inspiration • Hardware is all about parallelism but there is no virtualization of resources • If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly • Two-level compilation model • One can do a design with n adders but at some stage of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.

H.264 in Bluespec modulemkH264( IH264 ) • // Instantiate the modules • Nal nal <- mkNalUnwrap(); • ... • DeblockFilter deblock <- mkDeblockFilter(); • FrameMemory frameB <- mkFrameMemoryBuffer(); • //Connect the modules • mkConnection(nal.out, parse.in); • mkConnection(parse.out, iqit.in); • … • mkConnection(deblock.mem_client, frameB.mem_writer); • mkConnection(inter_pred.mem_client, frameB.mem_reader); • interfacein = nal.in; //Input goes straight to NAL • interfaceout = deblock.out; // Output from deblock endmodule Modularity and dataflow is obvious No sharing of resources No time multiplexing issue if each module is mapped on a separate core

Parse + CAVLC Inverse Quant Transformation NAL unwrap Compressed Bits Inter Prediction Deblock Filter Intra Prediction Frames Ref Frames H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin • 8K lines of Bluespec • Decodes 1080p@70fps • Area 4.4 mm sq (180nm) Are there ideas worth carrying over to Parallel SW? • Behaviors of modules are composable • Each module can be refined separately • Any module can be compiled in SW

What should we teach freshman

General guidelines • Make it easy to express the parallelism present in the application • no unnecessary sequentialization • no forced grouping of logically separate memories • Separate and deemphasize the issue of restructuring code for better sequential performance

Topics • Finite state machines • choose problems that have a natural solution as an FSM • show composition and interaction of parallel FSMs • Dataflow networks with unbounded and bounded edges • show programming of nodes in a sequential language with blocking sends and receives • Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time

Some challenges • No appropriate language or tools • Need to think up new illustrative problems from the ground up • Fibbonacci, “Hello world”, matrix multiply won’t do

Takeaway • Parallel programming is not a special topic in programming • Parallel programming is programming • Sequential and parallel programming can be introduced together • Parallel thinking is as natural as sequential thinking Thanks

WiFi: 64pt @ 0.25MHz WiMAX: 256pt @ 0.03MHz IFFT CP Insertion Scrambler FEC Encoder Interleaver Mapper Pilot & Guard Insertion TX Controller WUSB: 128pt 8MHz RX Controller FFT S/P Synchronizer De- Scrambler FEC Decoder De- Interleaver De- Mapper Channel Estimater WiFi:x7+x4+1 Convolutional WiMAX:x15+x14+1 Reed-Solomon WUSB:x15+x14+1 Turbo Zero cost parameterizationExample: OFDM based protocols D/A MAC A/D MAC standard specific potential reuse • Reusable algorithm with different parameter settings 85% reusable code between WiFi and WiMAX From WiFi to WiMAX in 4 weeks • Different throughput requirements • Different algorithms (Alfred) Man Chuek Ng, …

A Case for Teaching Parallel Programming to Freshmen

A Case for Teaching Parallel Programming to Freshmen

Presentation Transcript

Parallel Programming

A Pattern Language for Parallel Programming

PARALLEL programming

Potential for parallel computers/parallel programming

Introduction to Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

A Case for Language Support for Implicitly Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming