James Demmel (taken from David Culler, Lecture 18, CS267, 1997)

CS 267 Applications of Parallel ComputersLecture 28: LogP and the Implementation and Modeling of Parallel Sorts James Demmel (taken from David Culler, Lecture 18, CS267, 1997) http://www.cs.berkeley.edu/~demmel/cs267_Spr99

Practical Performance Target (circa 1992) • Sort one billion large keys in one minute on one thousand processors. • Good sort on a workstation can do 1 million keys in about 10 seconds • just fits in memory • 16 bit Radix Sort • Performance unit: µs per key per processor • s ~ 10 for single Sparc 2

Studies on Parallel Sorting PRAM Sorts Sorting on Machine X LogP Sorts MEM M M M °°° p p p °°° P P P network Sorting Networks Sorting on Network Y

The Study Interesting Parallel Sorting Algorithms (Bitonic, Column, Histo- radix, Sample) Implement in Split-C Analyze under LogP Estimate Execution Time Execute on CM-5 Parameters for CM-5 Compare ??

LogP

Deriving the LogP Model ° Processing – powerful microprocessor, large DRAM, cache => P ° Communication + significant latency (100's of cycles) => L + limited bandwidth (1 – 5% of memory bw) => g + significant overhead (10's – 100's of cycles) => o - on both ends – no consensus on topology => should not exploit structure + limited capacity – no consensus on programming model => should not enforce one

LogP • Latency in sending a (small) mesage between modules • overhead felt by the processor on sending or receiving msg • gap between successive sends or receives (1/BW) • Processors P ( processors ) P M P M P M ° ° ° o (overhead) o g (gap) L (latency) Limited Volume Interconnection Network ( L/ g to or from a proc)

Using the Model o o L ° Send n messages from proc to proc in time 2o + L + g(n-1) – each processor does o n cycles of overhead – has (g-o)(n-1) + L available compute cycles ° Send n messages from one to many in same time ° Send n messages from many to one in same time – all but L/g processors block so fewer available cycles o o L g time P P

Use of the Model (cont) ° Two processors sending n words to each other (i.e., exchange) in time 2o + L + max(g,2o) (n-1) £ max(g,2o) + L ° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic same exercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate time

LogP "philosophy" • Think about: • – mapping of N words onto P processors • – computation within a processor, its cost, and balance • – communication between processors, its cost, and balance • given a charaterization of processor and network performance • Do not think about what happens within the network This should be good enough!

Typical Sort Exploits the n = N/P grouping ° Significant local computation ° Very general global communication / transformation ° Computation of the transformation

Split-C • Explicitly parallel C • 2D global address space • linear ordering on local spaces • Local and Global pointers • spread arrays too • Read/Write • Get/Put (overap compute and comm) • x := G; . . . • sync(); • Signaling store (one-way) • G :– x; . . . • store_sync(); or all_store_sync(); • Bulk transfer • Global comm. Global Address Space local P0 P1 Pprocs-1

Basic Costs of operations in Split-C • Read, Write x = *G, *G = x 2 (L + 2o) • Store *G :– x L + 2o • Get x := *G o .... 2L + 2o sync(); o • with interval g • Bulk store (n words with w words/message) 2o + (n-1)g + L • Exchange 2o + 2L + (ì n/w ù - 1 - L/g) max(g,2o) • One to many • Many to one

LogP model • CM5: • L = 6 µs • o = 2.2 µs • g = 4 µs • P varies from 32 to 1024 • NOW • L = 8.9 • o = 3.8 • g = 12.8 • P varies up to 100 • What is the processor performance?

Sorting

Local Sort Performance (11 bit radix sort of 32 bits numbers) 10 9 Entropy in Key Values 8 7 31 6 25.1 µs / Key 16.9 5 10.4 4 6.2 3 <--------- TLB misses ----------> 2 1 0 15 20 0 5 10 Log N/P Entropy = -Si pi log pi , pi = Probability of key i

Local Computation Parameters - Empirical Parameter Operation µs per key Sort Swap Simulate cycle butterfly per key 0.025 lg N Bitonic mergesort Sort bitonic sequence 1.0 scatter Move key for Cyclic-to-block 0.46 gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwise local sort Local radix sort (11 bit) 4.5 if n < 64K 9.0 - (281000/n) merge Merge sorted lists 1.5 Column copy Shift Key 0.5 zero Clear histogram bin 0.2 Radix hist produce histogram 1.2 add produce scan value 1.0 bsum adjust scan of bins 2.5 address determine desitination 4.7 compare compare key to splitter 0.9 Sample localsort8 local radix sort of samples 5.0

Bottom Line (Preview) • Good fit between predicted and measured (10%) • Different sorts for different sorts • scaling by processor, input size, sensitivity • All are global / local hybrids • the local part is hard to implement and model 140.00 Bitonic 1024 120.00 Bitonic 32 100.00 80.00 Column 1024 us/key 60.00 Column 32 40.00 Radix 1024 20.00 Radix 32 0.00 Sample 1024 16384 32768 65536 131072 262144 524288 1048576 Sample 32 N/P

Odd-Even Merge - classic parallel sort N values to be sorted Treat as two lists of M = N/2 Sort each separately A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1 Redistribute into even and odd sublists A0 A2 … AM-2 A1 A3 … AM-1 B0 B2 … BM-2 B1 B3 … BM-1 Merge into two sorted lits E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1 Pairwise swaps of Ei and Oi will put it in order

Where’s the Parallelism? 1xN 2xN/2 4xN/4 E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1 1xN

A0 A1 A2 A3 B0 B1 B2 B3 A0 A1 A2 A3 B1 B0 B3 B2 A0 A1 A2 A3 B3 B2 B1 B0 Mapping to a Butterfly (or Hypercube) two sorted sublists Reverse Order of one list via cross edges A0 A1 A2 A3 B3 B2 B1 B0 Pairwise swaps on way back 2 3 4 8 7 6 5 1 2 3 4 1 7 6 5 8 2 1 4 3 5 6 7 8 1 2 3 4 5 6 7 8

Bitonic Sort with N/P per node A bitonic sequence decreases and then increases (or vice versa) Bitonic sequences can be merged like monotonic sequences all_bitonic(int A[PROCS]::[n]) sort(tolocal(&A[ME][0]),n,0) for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); } sort(tolocal(&A[ME][0]),n,mask(i)); sort swap

Bitonic Sort Block Layout • remaining stages involve • Block-to-cyclic, local merges (i - lg N/P cols) • cyclic-to-block, local merges ( lg N/p cols within stage) lg N/p stages are local sort

Analysis of Bitonic • How do you do transpose? • Reading Exercise

Bitonic Sort: time per key Measured Predicted 80 80 70 70 512 60 60 256 50 50 128 40 40 us/key us/key 30 30 64 20 20 32 10 10 0 0 16384 32768 65536 16384 32768 65536 131072 262144 524288 131072 262144 524288 1048576 1048576 N/P N/P

Bitonic: Breakdown P= 512, random

Bitonic: Effect of Key Distributions P = 64, N/P = 1 M

Column Sort (2) Transpose - block to cyclic Treat data like n x P array, with n >= P^2, I.e. N >= P^3 (4) Transpose - cyclic to block w/o scatter (3) Sort (1) Sort (6) shift (8) Unshift (5) Sort (7) merge work efficient

Column Sort: Times Only works for N >= P^3 Measured Predicted 40 40 35 35 512 30 30 256 25 25 128 20 20 us/key us/key 15 15 64 10 10 32 5 5 0 0 16384 32768 65536 16384 32768 65536 131072 262144 524288 131072 262144 524288 1048576 1048576 N/P N/P

Column: Breakdown P= 64, random

Column: Key distributions 35 30 25 Merge 20 Sorts µs / key Remaps 15 Shifts 10 5 0 17 25 31 0 6 10 Entropy (bits) P = 64, N/P = 1M

Histo-radix sort P Per pass: 1. compute local histogram 2. compute position of 1st member of each bucket in global array – 2^r scans with end-around 3. distribute all the keys Only r = 8,11,16 make sense for sorting 32 bit numbers n=N/P 2 2^r 3

Histo-Radix Sort (again) Local Data Local Histograms P • Each Pass • form local histograms • form global histogram • globally distribute data

Radix Sort: Times Predicted Measured 140 140 120 120 512 100 100 256 80 80 128 us/key us/key 60 60 64 40 40 32 20 20 0 0 16384 32768 65536 16384 32768 65536 131072 262144 524288 131072 262144 524288 1048576 1048576 N/P N/P

Radix: Breakdown

Radix: Key distribution Slowdown due to contention in redistribution

Radix: Stream Broadcast Problem n (P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well

What’s the right communication mechanism? • Permutation via writes • consistency model? • false sharing? • Reads? • Bulk Transfers? • what do you need to change in the algorithm? • Network scheduling?

Sample Sort 1. compute P-1 values of keys that would split the input into roughly equal pieces. – take S~64 samples per processor – sort PS keys – take key S, 2S, . . . (P-1)S – broadcast splitters 2. Distribute keys based on splitters 3. Local sort [4.] possibly reshift

Sample Sort: Times Measured Predicted 30 30 25 25 512 20 20 256 128 15 15 us/key us/key 64 10 10 32 5 5 0 0 16384 32768 65536 16384 32768 65536 131072 262144 524288 131072 262144 524288 1048576 1048576 N/P N/P

Sample Breakdown 30 Split 25 Sort 20 Dist 15 us/key Split-m 10 Sort-m 5 Dist-m 0 16384 32768 65536 131072 262144 524288 1048576 N/P

Comparison • Good fit between predicted and measured (10%) • Different sorts for different sorts • scaling by processor, input size, sensitivity • All are global / local hybrids • the local part is hard to implement and model 140.00 Bitonic 1024 120.00 Bitonic 32 100.00 80.00 Column 1024 us/key 60.00 Column 32 40.00 Radix 1024 20.00 Radix 32 0.00 Sample 1024 16384 32768 65536 131072 262144 524288 1048576 Sample 32 N/P

Conclusions • Distributed memory model leads to hybrid global / local algorithms • LogP model is good enough for the global part • bandwidth (g) or overhead (o) matter most • including end-point contention • latency (L) only matters when BW doesn’t • g is going to be what really matters in the days ahead (NOW) • Local computational performance is hard! • dominated by effects of storage hierarchy (TLBs) • getting trickier with multilevels • physical address determines L2 cache behavior • and with real computers at the nodes (VM) • and with variations in model • cycle time, caches, . . . • See http://www.cs.berkeley.edu/~culler/papers/sort.ps • See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps • disk-to-disk parallel sorting

James Demmel (taken from David Culler, Lecture 18, CS267, 1997)