Kathy Yelick cs.berkeley/~dmartin/cs267/

CS 267 Applications of Parallel ComputersLecture 5: Sources of Parallelism (continued)Shared-Memory Multiprocessors Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267/

Outline • Recap • Parallelism and Locality in PDEs • Continuous Variables Depending on Continuous Parameters • Example: The heat equation • Euler’s method • Indirect methods • Shared Memory Machines • Historical Perspective: Centralized Shared Memory • Bus-based Cache-coherent Multiprocessors • Scalable Shared Memory Machines

Recap: Source of Parallelism and Locality • Discrete event system • model is discrete space with discrete interactions • synchronous and asynchronous versions • parallelism over graph of entities; communication for events • Particle systems • discrete entities moving in continuous space and time • parallelism between particles; communication for interactions • ODEs • systems of lumped (discrete) variables, continuous parameters • parallelism in solving (usually sparse) linear systems • graph partitioning for parallelizing the sparse matrix computation • PDEs (today)

Continuous Variables, Continuous Parameters Examples of such systems include • Heat flow: Temperature(position, time) • Diffusion: Concentration(position, time) • Electrostatic or Gravitational Potential: Potential(position) • Fluid flow: Velocity,Pressure,Density(position,time) • Quantum mechanics: Wave-function(position,time) • Elasticity: Stress,Strain(position,time)

d u(x,t) d2 u(x,t) dt dx2 = C * Example: Deriving the Heat Equation Consider a simple problem • A bar of uniform material, insulated except at ends • Let u(x,t) be the temperature at position x at time t • Heat travels from x to x+h at rate proportional to: 0 x x+h 1 d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h dt h = C * • As h 0, we get the heat equation:

Explicit Solution of the Heat Equation • For simplicity, assume C=1 • Discretize both time and position • Use finite differences with xt[i] as the heat at • time t and position I • initial conditions on x0[i] • boundary conditions on xt[0] and xt[1] • At each timestep • This corresponds to • matrix vector multiply • nearest neighbors on grid t=5 t=4 t=3 t=2 t=1 t=0 x[i]t+1 = z*xt[i-1] + (1-2*z)*xt[i] + z*xt[i+1] where z = k/h2 x[0] x[1] x[2] x[3] x[4] x[5]

Parallelism in Explicit Method for PDEs • Partitioning the space (x) into p largest chunks • good load balance (assuming large number of points relative to p) • minimized communication (only p chunks) • Generalizes to • multiple dimentions • arbitrary graphs (= sparse matrices) • Problem with explicit approach • numerical instability • need to make the timesteps very small

Implicit Solution • As with many (stiff) ODEs, need an implicit method • This turns into solving the following equation • Where I is the identity matrix and T is: • I.e., essentially solving Poisson’s equation (I + (z/2)*T) * xt+1 = (I - (z/2)*T) * xt 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 T =

2D Implicit Method • Similar to the 1D case, but the matrix T is now • Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation • To solve this system, there are several techniques 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 T =

Algorithms for Solving the Poisson Equation Algorithm Serial PRAM Mem #Procs • Dense LU N3 N N2 N2 • Band LU N2 N N3/2 N • Jacobi N2 N N N • Conj.Grad. N 3/2 N 1/2 *log N N N • RB SOR N 3/2 N 1/2 N N • Sparse LU N 3/2 N 1/2 N*log N N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication

Administrative • HW2 extended to Monday, Feb. 16th • Break • On to shared memory machines

Programming Recap and History of SMPs

Relationship of Architecture and Programming Model Parallel Application Programming Model User / System Interface compiler library operating system HW / SW interface (comm. primitives) Hardware

Shared Address Space Programming Model • Collection of processes • Naming: • Each can name data in a private address space and • all can name all data in a common “shared” address space • Operations • Uniprocessor operations, plus sychronization operations on shared address • lock, unlock, test&set, fetch&add, ... • Operations on the shared address space appear to be performed in program order • it’s own operations appear to be in program order • all see a consistent interleaving of each other’s operations • like timesharing on a uniprocessor – explicit synchronization operations used when program ordering is not sufficient.

Example: shared flag indicating full/empty • Intuitively clear that intention was to convey meaning by order of stores • No data dependences • Sequential compiler / architecture would be free to reorder them! P1 P2 A = 1; a: while (flag is 0) do nothing; b: flag = 1; print A;

Historical Perspective • Diverse spectrum of parallel machines designed to implement a particular programming model directly • Technological convergence on collections of microprocessors on a scalable interconnection network • Map any programming model to simple hardware • with some specialization Message Passing Data Parallel Shared Address Space hypercubes and grids SIMD centralized shared memory ° ° ° Scalable Interconnection Network CA M ° ° ° $ essentially complete computer P

60s Mainframe Multiprocessors • Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices • How do you enhance processing capacity? • Add processors • Already need an interconnect between slow memory banks and processor + I/O channels • cross-bar or multistage interconnection network I/O De vices IOC IOC Mem Mem Mem Mem Inter connect Proc Pr oc

Caches: A Solution and New Problems

70s breakthrough • Caches! memory (slow) A: 17 interconnect I/O Device or Processor P processor (fast)

Technology Perspective Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.4x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1000:1! 2:1!

Bus Bottleneck and Caches Assume 100 MB/s bus 50 MIPS processor w/o cache => 200 MB/s inst BW per processor => 60 MB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate (16 byte block) => 4 MB/s inst BW per processor => 12 MB/s data BW per processor => 16 MB/s combined BW \ 8 processors will saturate bus I/O MEM ° ° ° MEM 16 MB/s ° ° ° cache cache 260 MB/s PROC PROC Cache provides bandwidth filter – as well as reducing average access time

Cache Coherence: The Semantic Problem • Scenario: • p1 and p2 both have cached copies of x (as 0) • p1 writes x=1 and then the flag, f=1 pulling f into its cache • both of these writes may write through to memory • p2 reads f (bringing it into cache) to see if it is 1, which it is • p2 therefore reads x, but gets the stale cached copy x = 1 f = 1 x 1 f 1 x 0 f 1 p1 p2

Snoopy Cache Coherence • Bus is a broadcast medium • all caches can watch other’s mem ops • All processors write through: • update local cache and global bus write: • updates main memory • invalidates/updates all other caches with that item • Examples: Early Sequent and Encore machines. • Cache stay coherent • Consistent view of memory! • One shared write at a time • Performance is much worse than uniprocessor • write-back caches • Since ~15-30% of references are writes, this scheme consumes tremendous bus bandwidth. Few processors can be supported.

Write-Back/Ownership Schemes • When a single cache has ownership of a block, processor writes do not result in bus writes, thus conserving bandwidth. • reads by others cause it to return to “shared” state • Most bus-based multiprocessors today use such schemes. • Many variants of ownership-based protocols

Programming SMPs • Consistent view of shared memory • All addresses equidistant • don’t worry about data partitioning • Automatic replication of shared data close to processor • If program concentrates on a block of the data set that no one else updates => very fast • Communication occurs only on cache misses • cache misses are slow • Processor cannot distinguish communication misses from regular cache misses • Cache block may introduce artifacts • two distinct variables in the same cache block • false sharing

Scalable Cache-Coherence

90 Scalable, Cache Coherent Multiprocessors

SGI Origin 2000

90’s Pushing the bus to the limit: Sun Enterprise

90’s Pushing the SMP to the masses

Caches and Scientific Computing • Caches tend to perform worst on demanding applications that operate on large data sets • transaction processing • operating systems • sparse matrices • Modern scientific codes use tiling/blocking to become cache friendly • easier for dense codes than for sparse • tiling and parallelism are similar transformations

Scalable Global Address Space

Structured Shared Memory: SPMD machine physical address space Each Process is same program with same address space layout Pn pr te i v a load Pn x P2 ysical Common Ph P1 Ad dr esses P0 x e stor P2 pr te i v a Shar ed P or tion of Ad dr ess Space P1 pr te i v a te P or tion Pr i v a of Ad dr ess Space te P0 pr i v a

$ $ P P mmu mmu Large Scale Shared Physical Address Scalable Network • Processor performs load • Pseudo-memory controller turns it into a message transaction with a remote controller, which performs the memory operation and replies with the data. • Examples: BBN butterfly, Cray T3D tag data rrsp src tag src dest addr read ° ° ° Pseudo Mem Pseudo Proc M M Ld R<- Addr

$ P mmu Cray T3D Resp In Req Out Req in Resp Out 3D Torus of Pair of PEs – share net & BLT – upto 2048 – 64 MB each Msg Queue - 4080x4x64 BLT PE# + FC Prefetch Queue - 16 x 64 DTB 150 MHz Dec Alpha (64-bit) 8 KB Inst + 8 KB Data 43-bit Virtual Address 32 & 64 bit mem + byte operations Non-blocking stores + mem-barrier Prefetch Load-lock, Store Conditional DRAM 32-bit P.A. - 5 + 27 Special Registers - swaperand - fetch&add - barrier

The Cray T3D • 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network • 43-bit virtual address space, 32-bit physical • 32-bit and 64-bit load/store + byte manipulation on regs. • no L2 cache • non-blocking stores, load/store re-ordering, memory fence • load-lock / store-conditional • Direct global memory access via external segment regs • DTB annex, 32 entries, remote processor number and mode • atomic swap between special local reg and memory • special fetch&inc register • global-OR, global-AND barriers • Prefetch Queue • Block Transfer Engine • User-level Message Queue

T3D Local Read (average latency) No TLB ! Line Size: 32 bytes L1 Cache Size: 8KB DRAM page miss: 100 ns (15 cycles) Memory Access Time: 155 ns (23 cycles) Cache Access Time: 6.7 ns (1 cycle)

T3D Remote Read Uncached 3 - 4x Local Memory Read ! 100 ns DRAM-page miss 610 ns (91 cycles) DEC Alpha local T3D Network Latency: Additional 13-20 ns (2-3 cycles) per hop

Bulk Read Options

Where are things going • High-end • collections of almost complete workstations/SMP on high-speed network • with specialized communication assist integrated with memory system to provide global access to shared data • Mid-end • almost all servers are bus-based CC SMPs • high-end servers are replacing the bus with a network • Sun Enterprise 10000, IBM J90, HP/Convex SPP • volume approach is Pentium pro quadpack + SCI ring • Sequent, Data General • Low-end • SMP desktop is here • Major change ahead • SMP on a chip as a building block

Kathy Yelick cs.berkeley/~dmartin/cs267/

Kathy Yelick cs.berkeley/~dmartin/cs267/

Presentation Transcript