1 / 73

Understanding Parallel Computer Architecture

This lecture provides an overview of parallel computer architecture, including communication architecture, major programming models, and the general structure and fundamental issues. It discusses the abstraction versus implementation in parallel computer architecture and the two facets of computer architecture. It also covers communication architecture, layered perspective, CAD database, scientific modeling, parallel applications, and more.

palazzo
Download Presentation

Understanding Parallel Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computer Architecture并行计算机体系结构Lecture 2 March 4, 2019 Wu junmin (jmwu@ustc.edu.cn)

  2. Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction Vs. Implementation

  3. Parallel Computer Architecture • Two facets of Computer Architecture: • Defines Critical Abstractions • especially at HW/SW boundary • set of operations and data types these operate on • Organizational structure that realizes these abstraction • Parallel Computer Arch. = Comp. Arch + Communication Arch. • Comm. Architecture has same two facets • communication abstraction • primitives at user/system and hw/sw boundary

  4. Layered Perspective of PCA CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Hardware/Software boundary Communication hardware Physical communication medium

  5. Communication Architecture User/System Interface + Organization • User/System Interface: • Comm. primitives exposed to user-level by hw and system-level sw • Implementation: • Organizational structures that implement the primitives: HW or OS • How optimized are they? How integrated into processing node? • Structure of network • Goals: • Performance • Broad applicability • Programmability • Scalability • Low Cost

  6. Communication Abstraction • User level communication primitives provided • Realizes the programming model • Mapping exists between language primitives of programming model and these primitives • Supported directly by hw, or via OS, or via user sw • Lot of debate about what to support in sw and gap between layers • Today: • Hw/sw interface tends to be flat, i.e. complexity roughly uniform • Compilers and software play important roles as bridges today • Technology trends exert strong influence • Result is convergence in organizational structure • Relatively simple, general purpose communication primitives

  7. Understanding Parallel Architecture • Traditional taxonomies not very useful • Programming models not enough, nor hardware structures • Same one can be supported by radically different architectures • Compilers, libraries, programs • Design of user/system and hardware/software interface • Constrained from above by progr. models and below by technology • Guiding principles provided by layers • What primitives are provided at communication abstraction • How programming models map to these • How they are mapped to hardware

  8. Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation

  9. Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory History • Parallel architectures tied closely to programming models • Divergent architectures, with no predictable pattern of growth. • Mid 80s renaissance

  10. Programming Model • Conceptualization of the machine that programmer uses in coding applications • How parts cooperate and coordinate their activities • Specifies communication and synchronization operations • Multiprogramming • no communication or synch. at program level • Shared address space • like bulletin board • Message passing • like letters or phone calls, explicit point to point • Data parallel: • more regimented, global actions on data • Several agents perform an action on separate elements of a data set simultaneously and then exchange information globally before continuing en masse. • Implemented with shared address space or message passing

  11. Shared Address Space Model

  12. Shared Memory => Shared Addr. Space • Primitive one to one with Comm. Abstraction • Why its attractive? • Ease of Programming • Memory capacity increased by adding modules • I/O by controllers and devices • Add processors for processing! • For higher-throughput multiprogramming, or parallel programs

  13. Shared Physical Memory • Any processor can directly reference any memory location • Any I/O controller - any memory • Operating system can run on any processor, or all. • OS uses shared memory to coordinate • Communication occurs implicitly as result of loads and stores • What about application processes?

  14. Shared Virtual Address Space • Process = address space plus thread of control • Virtual-to-physical mapping can be established so that processes shared portions of address space. • User-kernel or multiple processes • Multiple threads of control on one address space. • Popular approach to structuring OS’s • Now standard application capability (ex: POSIX threads, Java Thread) • Writes to shared address visible to other threads • Natural extension of uniprocessor model • conventional memory operations for communication • special atomic operations for synchronization • also load/stores

  15. Machine physical address space Virtual address spaces for a collection of processes communicating via shared addresses P p r i v a t e n L o a d P n Common physical addresses P 2 P 1 P 0 S t o r e P p r i v a t e 2 Shared portion of address space P p r i v a t e 1 Private portion of address space P p r i v a t e 0 Structured Shared Address Space • Ad hoc parallelism used in system code • Most parallel applications have structured SAS • Same program on each processor • shared variable X means the same thing to each thread

  16. Shared Memory: 计算π代码段 • * # define N 100000 • main ( ){ • double local,pi=0.0 ,w; • long i ; • A : w=1. 0/N; • B : # Pragma Parallel • # Pragma Shared (pi,w) • # Pragma Local (i,local) • { • # Pragma pfor iterate(i=0;N;1) • for (i=0;i<N,i++){ • P: local = (i+0.5)*w; • Q: local=4.0/(1.0+local*local); • } • C : # Pragma Critical • pi =pi +local ; • } • D: printf (“pi is % f \ n”,pi *w); • }

  17. Historical Development • “Mainframe” approach • Motivated by multiprogramming • Extends crossbar used for Mem and I/O • Processor cost-limited => crossbar • Bandwidth scales with p • High incremental cost • use multistage instead (l and bw) • “Minicomputer” approach • Almost all microprocessor systems have bus • Motivated by multiprogramming, TP • Used heavily for parallel computing • Called symmetric multiprocessor (SMP) • Latency larger than for uniprocessor • Bus is bandwidth bottleneck • caching is key: coherence problem • Low incremental cost

  18. Engineering: Intel Pentium Pro Quad • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth

  19. MultiCore

  20. Engineering: SUN Enterprise • Proc + mem card - I/O card • 16 cards of either type • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus

  21. M M M ° ° ° Network Network ° ° ° ° ° ° M M M $ $ $ $ $ $ P P P P P P Scaling Up • Problem is interconnect: cost (crossbar) or bandwidth (bus) • Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large • Distributed memory or non-uniform memory access (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) • Caching shared (particularly nonlocal) data? “Dance hall” Distributed memory

  22. Engineering: Cray T3E • Scale up to 1024 processors, 480MB/s links • Memory controller generates request message for non-local references • No hardware mechanism for coherence • SGI Origin etc. provide this

  23. SGI Altix UV 1000

  24. Network ° ° ° M M M $ $ $ P P P Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

  25. Network ° ° ° M M M $ $ $ P P P Message Passing Architectures • Complete computer as building block, including I/O • Communication via explicit I/O operations • Programming model • direct access only to private address space (local memory), • communication via explicit messages (send/receive) • High-level block diagram • Communication integration? • Mem, I/O, LAN, Cluster • Easier to build and scale than SAS • Programming model more removed from basic hardware operations • Library or OS intervention

  26. Match Receive Y , P , t Addr ess Y Send X, Q, t Addr ess X Local pr ocess Local pr ocess addr ess space addr ess space Pr ocess P Pr ocess Q Message-Passing Abstraction • Sender specifies buffer to be transmitted and receiving process • Recver specifies sending process and application storage to receive into • Memory to memory copy, but need to name processes • Optional tag on send and matching rule on receive • User process names local data and entities in process/tag space too • In simplest form, the send/recv match achieves pairwise synch event • Other variants too • Many overheads: copying, buffer management, protection

  27. Message Passing: 计算π代码段 # define N 100000 • main ( ){ • double local=0.0,pi,w,temp=0.0; • long i ,taskid,numtask; • A: w=1.0/N; • MPI_ Init(&argc,& argv); • MPI _Comm _rank (MPI_COMM_WORLD,&taskid); • MPI _Comm _Size (MPI_COMM_WORLD,&numtask); • B: for (i= taskid; i< N; i=i + numtask){ • P: temp = (i+0.5)*w; • Q: local=4.0/(1.0+temp*temp)+local; • } • C:MPI_Reduce (&local,&pi,1,MPI_Double,MPI_MAX,0,MPI_COMM_WORLD) • D: if (taskid = =0) printf(“pi is % f \ n”,pi* w); • MPI_Finalize ( ) ; • }

  28. Evolution of Message-Passing Machines • Early machines: FIFO on each link • HW close to prog. Model; • synchronous ops • topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)

  29. Diminishing Role of Topology • Shift to general links • DMA, enabling non-blocking ops • Buffered by system at destination until recv • Store&forward routing • Diminishing role of topology • Any-to-any pipelined routing • node-network interface dominates communication time • Simplifies programming • Allows richer design space • grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B

  30. Example Intel Paragon

  31. Building on the mainstream: IBM SP-2 • Made out of essentially complete RS6000 workstations • Network interface integrated in I/O bus (bw limited by I/O bus)

  32. Berkeley NOW • 100 Sun Ultra2 workstations • Inteligent network interface • proc + mem • Myrinet Network • 160 MB/s per link • 300 ns per hop

  33. Toward Architectural Convergence • Evolution and role of software have blurred boundary • Send/recv supported on SAS machines via buffers • Can construct global address space on MP (GA -> P | LA) • Page-based (or finer-grained) shared virtual memory • Hardware organization converging too • Tighter NI integration even for MP (low-latency, high-bandwidth) • Hardware SAS passes messages • Even clusters of workstations/SMPs are parallel systems • Emergence of fast system area networks (SAN) • Programming models distinct, but organizations converging • Nodes connected by general network and communication assists • Implementations also converging, at least in high-end machines

  34. Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example

  35. Program Models Serve to impose Structure on Programs

  36. Data Parallel Systems • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element • Architectural model • Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Original motivations • Matches simple differential equation solvers • Centralize high cost of instruction fetch/sequencing

  37. Data Parallel: 计算π代码段 • main( ){ • double local [N],temp [N],pi,w; • long i,j,t,N=100000; • A: w=1.0/N; • B: forall (i=0;i<N ; i++){ • P: local[i]=(i+0.5)*w • Q: temp[i]=4.0/(1.0+local[i]*local[i]); • } • C: pi = sum (temp); • D: printf (“pi is % f \ n”,pi * w ); • }

  38. Connection Machine (Tucker, IEEE Computer, Aug. 1988)

  39. Evolution and Convergence • SIMD Popular when cost savings of centralized sequencer high • 60s when CPU was a cabinet • Replaced by vectors in mid-70s • More flexible w.r.t. memory layout and easier to manage • Revived in mid-80s when 32-bit datapath slices just fit on chip • Simple, regular applications have good locality • Programming model converges with SPMD (single program multiple data) • need fast global synchronization • Structured global address space, implemented with either SAS or MP

  40. CM-5 • Repackaged SparcStation • 4 per board • Fat-Tree network • Control network for global synchronization

  41. 1 b c e a=(b+1)*(b-c) d=c*e f=a*d + * - * * d a Dataflow graph f Dataflow Architecture • Program is represented by a graph of essential data dependences. • An instruction may execute whenever its data operands are available. • The graph spread arbitrarily over a collection of processors.

  42. Network Program store Token store Network Waiting Matching Instruction Matching Execute Form token Token queue Network Basic execution pipeline • Token: message consists of data and tag of its destination node.

  43. x4 xin x8 x6 x3 x1 x w4 w3 w1 w2 w y3 y2 yin y1 Systolic Architecture • A given algorithm is represented directly as a collection of specialized computational units connected in a regular, space efficient pattern. • Data would move through the system at regular “heartbeats” . Y(i)=w1*x(i)+w2*x(i+1)+w3*x(i+2)+w4*x(i+3) x7 x5 x2 xout xout=x x =xin yout=yin+w*xin yout

  44. Flynn’s Taxonomy • # instruction x # Data

  45. SIMD

  46. MIMD

  47. MISD

  48. Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation

  49. Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

  50. Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example

More Related