Algorithms-based extension of serial computing education to parallelism

Algorithms-based extension of serial computing education to parallelism • Using Simple Abstraction to Reinvent Computing for Parallelism, CACM, January 2011, pp. 75-85 • http://www.umiacs.umd.edu/users/vishkin/XMT/ Uzi Vishkin

Commodity computer systems If you want your program to run significantly faster … you’re going to have to parallelize it  Parallelism: only game in town But, what about the programmer? “The Trouble with Multicore: Chipmakers are busy designing microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010 Only heroic programmers can exploit the vast parallelism in current machines – Report by CSTB, U.S. National Academies 12/2010 A San Antonio spin Where would Mr. Maverick be on this issue? Conform with things that do not really work?!

Parallel Random-Access Machine/Model PRAM: • n synchronous processors all having unit time access to a shared memory. • Each processor has also a local memory. • At each time unit, a processor can: • write into the shared memory (i.e., copy one of its local memory registers into a shared memory cell), • 2. read into shared memory (i.e., copy a shared memory cell into one of its local memory registers ), or • 3. do some computation with respect to its local memory.

is presented in terms of a sequence of parallel time units (or “rounds”, or “pulses”); we allow p instructions to be performed at each time unit, one per processor; this means that a time unit consists of a sequence of exactly p instructions to be performed concurrently So, an algorithm in the PRAM model SV-MaxFlow-82: way too difficult Contrast e.g, TCPP 12/2010: simplest parallel model 2 drawbacks to PRAM mode Does not reveal how the algorithm will run on PRAMs with different number of processors; e.g., to what extent will more processors speed the computation, or fewer processors slow it? (ii) Fully specifying the allocation of instructions to processors requires a level of detail which might be unnecessary (e.g., a compiler may be able to extract from lesser detail) 1st round of discounts ..

Work-Depth presentation of algorithms Work-Depth algorithms are also presented as a sequence of parallel time units (or “rounds”, or “pulses”); however, each time unit consists of a sequence of instructions to be performed concurrently; the sequence of instructions may include any number. Why is this enough? See J-92, KKT01, or my classnotes SV-MaxFlow-82: still way too difficult Drawback to WD mode Fully specifying the serial number of each instruction requires a level of detail that may be added later 2nd round of discounts ..

Informal Work-Depth (IWD) description Similar to Work-Depth, the algorithm is presented in terms of a sequence of parallel time units (or “rounds”); however, at each time unit there is a set containing a number of instructions to be performed concurrently. ‘ICE’ Descriptions of the set of concurrent instructions can come in many flavors. Even implicit, where the number of instruction is not obvious. The main methodical issue addressed here is how to train CS&E professionals “to think in parallel”. Here is the informal answer: train yourself to provide IWD description of parallel algorithms. The rest is detail (although important) that can be acquired as a skill, by training (perhaps with tools). Why is this enough for PRAM? See J-92, KKT01, or my classnotes

Input: (i) All world airports. (ii) For each, all its non-stop flights. Find: smallest number of flights from DCA to every other airport. Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet unvisited” airports as requiring i flights (note nesting) Serial: forces queue; O(T) time; T – total # of flights Parallel: parallel data-structures. Inherent serialization: S. Gain relative to serial: (first cut) ~T/S! Decisive also relative to coarse-grained parallelism. Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm (ii) No “decomposition”/”partition” Mental effort of PRAM-like programming 1. sometimes easier than serial 2. considerably easier than for any parallel computer currently sold. Understanding falls within the common denominator of other approaches. Example of Parallel ‘PRAM-like’ Algorithm

Elements in My education platform • Identify ‘thinking in parallel’ with the basic abstraction behind the SV82b work-depth framework. Note: adopted as the presentation framework in PRAM algorithms texts: J92, KKT01. • Teach as much PRAM algorithms as timing and developmental stage of the students permit; extensive ‘dry’ theory homework: is required from graduate students, but little from high-school students. • Students self-study programming in XMTC (standard C plus 2 commands, spawn and prefix-sum) and do demanding programming assignments • Provide a programmer’s workflow: links the simple PRAM abstraction with XMTC (even tuned) programming. The synchronous PRAM provides ease of algorithm design and reasoning about correctness and complexity. Multi-threaded programming relaxes this synchrony for implementation. Since reasoning directly about soundness and performance of multi-threaded code is known to be error prone, the workflow only tasks the programmer with: establish that the code behavior matches the PRAM-like algorithm • Unlike PRAM, XMTC is far from ignoring locality. Unlike most approaches, XMTC preempts harm of locality on programmer’s productivity. • If XMT architecture is presented: only at the end of the course; parallel programming more difficult than serial that does not require architecture.

Where to find a machine that supports effectively such parallel algorithms? • Parallel algorithms researchers realized decades ago that the main reason that parallel machines are difficult to program has been that bandwidth between processors/memories is limited. Lower bounds [VW85,MNV94]. • [BMM94]: 1. HW vendors see the cost benefit of lowering performance of interconnects, but grossly underestimate the programming difficulties and the high software development costs implied. 2. Their exclusive focus on runtime benchmarks misses critical costs, including: (i) the time to write the code, and (ii) the time to port the code to different distribution of data or to different machines that require different distribution of data. G. Blelloch, B. Maggs & G. Miller. The hidden cost of low bandwidth communication. In Developing a CS Agenda for HPC (Ed. U. Vishkin). ACM Press, 1994 • Patterson, CACM04: Latency Lags Bandwidth. HP12: as latency improved by 30-80X, bandwidth improved by 10-25KX  Isn’t this great news: cost benefit of low bandwidth drastically decreasing • Not so fast. Senior HW Eng, 1/2011: Okay, you do have a ‘convenient’ way to do parallel programming; so what’s the big deal?! • Commodity HW  Decomposition-first programming doctrine  heroic programmers  sigh … Has the ‘bw  ease-of-programming opportunity’ got lost? Do we sugarcoat a salty cake instead of ‘return to baker/store’?

Suggested answers in this talk (soft, more like BMM) • Fault line One side: commodity HW. Other side: this ‘convenient way’ • ‘Life’ across fault line  so, what’s the point of heroic programmers?! • ‘Every CS major could program’: ‘no way’ vs promising evidence • Sooner or later, system vendors will see the connection to their bottom line and abandon directions perceived today as hedging one’s bets

The fault lineIs PRAM Too Easy or Too difficult? BFS Example BFS in TCPP curriculum, 12/2010. But, 1. XMT/GPU Speed-ups: same-silicon area, highly parallel input: 5.4X! Small HW configuration, 20-way parallel input: 109X wrt same GPU Note: BFS on GPUs: research papers; butPRAM version: too easy for paper Makes one wonder: why work so hard on a GPU? 2. BFS using OpenMP. Good news: Easy coding (since no meaningful decomposition). Bad news: none of the 42 students in joint F2010 UIUC/UMD got any speedups (over serial) on an 8-processor SMP machine. So, not only PRAM was too easy: no speedups. Also BFS… Speedups on a 64-processor XMT, using <= 1/4 of the silicon area of SMP machine, ranged between 7x and 25x  PRAM is ‘too difficult’ approach worked. Makes one wonder BFS is unavoidable. Can we (professionals/instructors) really defend teaching/using OpenMP for it? Any other commercial approach?

Chronology around fault line Just right: PRAM model FW77 Too easy • ‘Paracomputer’ Schwartz80 • BSP Valiant90 • LOGP UC-Berkeley93 • Map-Reduce. Success; not manycore • CLRS-09, 3rd edition • TCPP curriculum 2010 • Nearly all parallel machines to date • “.. machines that most programmers cannot handle" • “Only heroic programmers” Too difficult • SV-82 and V-Thesis81 • PRAM theory (in effect) • CLR-90 1st edition • J-92 • NESL • KKT-01 • XMT97+ Supports the rich PRAM algorithms literature • V-11 Nested parallelism: issue for both; e.g., Cilk Current interest new "computing stacks“: programmer's model, programming languages, compilers, architectures, etc. Merit of fault-line image Two pillars holding a building (the stack) must be on the same side of a fault line  chipmakers cannot expect: wealth of algorithms and high programmer’s productivity with architectures for which PRAM is too easy (e.g., force programming for decomposition).

Telling a fault line from the surface PRAM too difficult PRAM too easy PRAM “simplest model”* BSP/Cilk * In(e/su)fficient bandwidth *per TCPP Surface Fault line • ICE • WD • PRAM Effective bandwidth Old soft claim, e.g., [BMM94]: hidden cost of low bandwidth New soft claim: the surface (PRAM easy/difficult) reveals side WRT the bandwidth fault line.

Ease of Teaching/Learning Benchmark Can any CS major program your manycore? Cannot really avoid it! Teachability demonstrated so far for XMT [SIGCSE’10] - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort*, integer-sort* & sample-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. See also, keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher. *Also in Nvidia’s Satish, Harris & Garland IPDPS09

Middle School Summer Camp Class Picture, July’09 (20 of 22 students)

From UIUC/UMD questionnaire Split between UIUC and UMD students on: did PRAM algorithms help for XMT programming? UMD students: strong yes. Majority of Illinois students: No. Exposure of UIUC students to PRAM algorithms and XMT programming was more limited. This may demonstrate that students must be exposed to a minimal amount of parallel algorithms and their programming in order to internalize their merit. If this conclusion is valid, it creates tension with: 1. The pressure on instructors of parallel computing courses to cover several programming paradigms along with their required architecture background; 2. The tendency to teach “Parallel computing” as a hodgepodge of topics jumping from one to the other without teaching anything at any depth, contrary to many other CS courses

Not just talking Algorithms PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT (Explicit Multi-Threaded) architecture SPAA98..CF08 128-core intercon. networkIBM 90nm: 9mmX5mm, 400 MHz [HotI07]Fund work on asynch NOCS’10 FPGA designASIC IBM 90nm: 10mmX10mm 150 MHz PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase “Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill. Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01 Later: programming & workflow Rudimentary yet stable compiler. Architecture scales to 1000+ cores on-chip

But, what is the performance penalty for easy programming?Surprisebenefit! vs. GPU [HotPar10] • 1024-TCU XMT simulations vs. code by others for GTX280. < 1 is slowdown. Sought: similar silicon area & same clock. • Postscript regarding BFS • 59X if average parallelism is 20 • 111X if XMT is … downscaled to 64 TCUs

Problem acronyms BFS: Breadth-first search on graphs Bprop: Back propagation machine learning alg. Conv: Image convolution kernel with separable filter Msort: Merge-sort algorith NW: Needleman-Wunsch sequence alignment Reduct: Parallel reduction (sum) Spmv: Sparse matrix-vector multiplication

New work Biconnectivity Not aware of GPU work 12-processor SMP: < 4X speedups. TarjanV log-time PRAM algorithm  practical version  significant modification. Their 1st try: 12-processor below serial XMT: >9X to <42X speedups. TarjanV practical version. More robust for all inputs than BFS, DFS etc. Significance: • log-time PRAM graph algorithms ahead on speedups. • Paper makes a similar case for Shiloach-V log-time connectivity. Beats also GPUs on both speed-up and ease (GPU paper versus grad course programming assignment and even couple of 10th graders implemented SV) Even newer result: PRAM max-flow (hybrid ShiloachV & GoldbergTarjan) provides unprecedented speedup

Algorithms-based extension of serial computing education to parallelism