1 / 40

Kathy Yelick cs.berkeley/~dmartin/cs267/

CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors. Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267/. Outline. Recap Parallelism and Locality in PDEs Continuous Variables Depending on Continuous Parameters

tonydavis
Download Presentation

Kathy Yelick cs.berkeley/~dmartin/cs267/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 267 Applications of Parallel ComputersLecture 5: Sources of Parallelism (continued)Shared-Memory Multiprocessors Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267/

  2. Outline • Recap • Parallelism and Locality in PDEs • Continuous Variables Depending on Continuous Parameters • Example: The heat equation • Euler’s method • Indirect methods • Shared Memory Machines • Historical Perspective: Centralized Shared Memory • Bus-based Cache-coherent Multiprocessors • Scalable Shared Memory Machines

  3. Recap: Source of Parallelism and Locality • Discrete event system • model is discrete space with discrete interactions • synchronous and asynchronous versions • parallelism over graph of entities; communication for events • Particle systems • discrete entities moving in continuous space and time • parallelism between particles; communication for interactions • ODEs • systems of lumped (discrete) variables, continuous parameters • parallelism in solving (usually sparse) linear systems • graph partitioning for parallelizing the sparse matrix computation • PDEs (today)

  4. Continuous Variables, Continuous Parameters Examples of such systems include • Heat flow: Temperature(position, time) • Diffusion: Concentration(position, time) • Electrostatic or Gravitational Potential: Potential(position) • Fluid flow: Velocity,Pressure,Density(position,time) • Quantum mechanics: Wave-function(position,time) • Elasticity: Stress,Strain(position,time)

  5. d u(x,t) d2 u(x,t) dt dx2 = C * Example: Deriving the Heat Equation Consider a simple problem • A bar of uniform material, insulated except at ends • Let u(x,t) be the temperature at position x at time t • Heat travels from x to x+h at rate proportional to: 0 x x+h 1 d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h dt h = C * • As h 0, we get the heat equation:

  6. Explicit Solution of the Heat Equation • For simplicity, assume C=1 • Discretize both time and position • Use finite differences with xt[i] as the heat at • time t and position I • initial conditions on x0[i] • boundary conditions on xt[0] and xt[1] • At each timestep • This corresponds to • matrix vector multiply • nearest neighbors on grid t=5 t=4 t=3 t=2 t=1 t=0 x[i]t+1 = z*xt[i-1] + (1-2*z)*xt[i] + z*xt[i+1] where z = k/h2 x[0] x[1] x[2] x[3] x[4] x[5]

  7. Parallelism in Explicit Method for PDEs • Partitioning the space (x) into p largest chunks • good load balance (assuming large number of points relative to p) • minimized communication (only p chunks) • Generalizes to • multiple dimentions • arbitrary graphs (= sparse matrices) • Problem with explicit approach • numerical instability • need to make the timesteps very small

  8. Implicit Solution • As with many (stiff) ODEs, need an implicit method • This turns into solving the following equation • Where I is the identity matrix and T is: • I.e., essentially solving Poisson’s equation (I + (z/2)*T) * xt+1 = (I - (z/2)*T) * xt 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 T =

  9. 2D Implicit Method • Similar to the 1D case, but the matrix T is now • Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation • To solve this system, there are several techniques 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 T =

  10. Algorithms for Solving the Poisson Equation Algorithm Serial PRAM Mem #Procs • Dense LU N3 N N2 N2 • Band LU N2 N N3/2 N • Jacobi N2 N N N • Conj.Grad. N 3/2 N 1/2 *log N N N • RB SOR N 3/2 N 1/2 N N • Sparse LU N 3/2 N 1/2 N*log N N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication

  11. Administrative • HW2 extended to Monday, Feb. 16th • Break • On to shared memory machines

  12. Programming Recap and History of SMPs

  13. Relationship of Architecture and Programming Model Parallel Application Programming Model User / System Interface compiler library operating system HW / SW interface (comm. primitives) Hardware

  14. Shared Address Space Programming Model • Collection of processes • Naming: • Each can name data in a private address space and • all can name all data in a common “shared” address space • Operations • Uniprocessor operations, plus sychronization operations on shared address • lock, unlock, test&set, fetch&add, ... • Operations on the shared address space appear to be performed in program order • it’s own operations appear to be in program order • all see a consistent interleaving of each other’s operations • like timesharing on a uniprocessor – explicit synchronization operations used when program ordering is not sufficient.

  15. Example: shared flag indicating full/empty • Intuitively clear that intention was to convey meaning by order of stores • No data dependences • Sequential compiler / architecture would be free to reorder them! P1 P2 A = 1; a: while (flag is 0) do nothing; b: flag = 1; print A;

  16. Historical Perspective • Diverse spectrum of parallel machines designed to implement a particular programming model directly • Technological convergence on collections of microprocessors on a scalable interconnection network • Map any programming model to simple hardware • with some specialization Message Passing Data Parallel Shared Address Space hypercubes and grids SIMD centralized shared memory ° ° ° Scalable Interconnection Network CA M ° ° ° $ essentially complete computer P

  17. 60s Mainframe Multiprocessors • Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices • How do you enhance processing capacity? • Add processors • Already need an interconnect between slow memory banks and processor + I/O channels • cross-bar or multistage interconnection network I/O De vices IOC IOC Mem Mem Mem Mem Inter connect Proc Pr oc

  18. Caches: A Solution and New Problems

  19. 70s breakthrough • Caches! memory (slow) A: 17 interconnect I/O Device or Processor P processor (fast)

  20. Technology Perspective Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.4x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1000:1! 2:1!

  21. Bus Bottleneck and Caches Assume 100 MB/s bus 50 MIPS processor w/o cache => 200 MB/s inst BW per processor => 60 MB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate (16 byte block) => 4 MB/s inst BW per processor => 12 MB/s data BW per processor => 16 MB/s combined BW \ 8 processors will saturate bus I/O MEM ° ° ° MEM 16 MB/s ° ° ° cache cache 260 MB/s PROC PROC Cache provides bandwidth filter – as well as reducing average access time

  22. Cache Coherence: The Semantic Problem • Scenario: • p1 and p2 both have cached copies of x (as 0) • p1 writes x=1 and then the flag, f=1 pulling f into its cache • both of these writes may write through to memory • p2 reads f (bringing it into cache) to see if it is 1, which it is • p2 therefore reads x, but gets the stale cached copy x = 1 f = 1 x 1 f 1 x 0 f 1 p1 p2

  23. Snoopy Cache Coherence • Bus is a broadcast medium • all caches can watch other’s mem ops • All processors write through: • update local cache and global bus write: • updates main memory • invalidates/updates all other caches with that item • Examples: Early Sequent and Encore machines. • Cache stay coherent • Consistent view of memory! • One shared write at a time • Performance is much worse than uniprocessor • write-back caches • Since ~15-30% of references are writes, this scheme consumes tremendous bus bandwidth. Few processors can be supported.

  24. Write-Back/Ownership Schemes • When a single cache has ownership of a block, processor writes do not result in bus writes, thus conserving bandwidth. • reads by others cause it to return to “shared” state • Most bus-based multiprocessors today use such schemes. • Many variants of ownership-based protocols

  25. Programming SMPs • Consistent view of shared memory • All addresses equidistant • don’t worry about data partitioning • Automatic replication of shared data close to processor • If program concentrates on a block of the data set that no one else updates => very fast • Communication occurs only on cache misses • cache misses are slow • Processor cannot distinguish communication misses from regular cache misses • Cache block may introduce artifacts • two distinct variables in the same cache block • false sharing

  26. Scalable Cache-Coherence

  27. 90 Scalable, Cache Coherent Multiprocessors

  28. SGI Origin 2000

  29. 90’s Pushing the bus to the limit: Sun Enterprise

  30. 90’s Pushing the SMP to the masses

  31. Caches and Scientific Computing • Caches tend to perform worst on demanding applications that operate on large data sets • transaction processing • operating systems • sparse matrices • Modern scientific codes use tiling/blocking to become cache friendly • easier for dense codes than for sparse • tiling and parallelism are similar transformations

  32. Scalable Global Address Space

  33. Structured Shared Memory: SPMD machine physical address space Each Process is same program with same address space layout Pn pr te i v a load Pn x P2 ysical Common Ph P1 Ad dr esses P0 x e stor P2 pr te i v a Shar ed P or tion of Ad dr ess Space P1 pr te i v a te P or tion Pr i v a of Ad dr ess Space te P0 pr i v a

  34. $ $ P P mmu mmu Large Scale Shared Physical Address Scalable Network • Processor performs load • Pseudo-memory controller turns it into a message transaction with a remote controller, which performs the memory operation and replies with the data. • Examples: BBN butterfly, Cray T3D tag data rrsp src tag src dest addr read ° ° ° Pseudo Mem Pseudo Proc M M Ld R<- Addr

  35. $ P mmu Cray T3D Resp In Req Out Req in Resp Out 3D Torus of Pair of PEs – share net & BLT – upto 2048 – 64 MB each Msg Queue - 4080x4x64 BLT PE# + FC Prefetch Queue - 16 x 64 DTB 150 MHz Dec Alpha (64-bit) 8 KB Inst + 8 KB Data 43-bit Virtual Address 32 & 64 bit mem + byte operations Non-blocking stores + mem-barrier Prefetch Load-lock, Store Conditional DRAM 32-bit P.A. - 5 + 27 Special Registers - swaperand - fetch&add - barrier

  36. The Cray T3D • 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network • 43-bit virtual address space, 32-bit physical • 32-bit and 64-bit load/store + byte manipulation on regs. • no L2 cache • non-blocking stores, load/store re-ordering, memory fence • load-lock / store-conditional • Direct global memory access via external segment regs • DTB annex, 32 entries, remote processor number and mode • atomic swap between special local reg and memory • special fetch&inc register • global-OR, global-AND barriers • Prefetch Queue • Block Transfer Engine • User-level Message Queue

  37. T3D Local Read (average latency) No TLB ! Line Size: 32 bytes L1 Cache Size: 8KB DRAM page miss: 100 ns (15 cycles) Memory Access Time: 155 ns (23 cycles) Cache Access Time: 6.7 ns (1 cycle)

  38. T3D Remote Read Uncached 3 - 4x Local Memory Read ! 100 ns DRAM-page miss 610 ns (91 cycles) DEC Alpha local T3D Network Latency: Additional 13-20 ns (2-3 cycles) per hop

  39. Bulk Read Options

  40. Where are things going • High-end • collections of almost complete workstations/SMP on high-speed network • with specialized communication assist integrated with memory system to provide global access to shared data • Mid-end • almost all servers are bus-based CC SMPs • high-end servers are replacing the bus with a network • Sun Enterprise 10000, IBM J90, HP/Convex SPP • volume approach is Pentium pro quadpack + SCI ring • Sequent, Data General • Low-end • SMP desktop is here • Major change ahead • SMP on a chip as a building block

More Related