Computer architecture II

Computer architecture II Introduction Computer Architecture II

Recap • Importance of parallelism • Architecture classification • Flynn (SISD, SIMD, MISD, MIMD) • Memory access (SM, MP) • Clusters • Grids • Top500 Computer Architecture II

Today's plan • Parallel Architecture convergence (Culler’s classification) • Shared Memory (Single Address Space) • Message Passing • Data parallel (SIMD) • Data flow • Systolic Computer Architecture II

Convergence of Architectural Models • Culler`s classification of parallel architectures • Shared Address Space • Message Passing • Data Parallel • Others: • Dataflow • Systolic Arrays • Examine programming model, motivation, intended applications, and contributions to convergence Computer Architecture II

Where is Parallel Arch Going? Old view: Divergent architectures, no predictable pattern of growth. Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory • Uncertainty of direction paralyzed parallel software development! Computer Architecture II

NEW VIEW: Convergence of parallel architectures Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory Computer Architecture II

Parallel computer • Last class definition: “A parallel computer is a collection of processing elements that cooperate to solve large problems fast” • Extend the sequential computer architecture with a communication architecture • Computer architecture has 2 important aspects • Abstractions: hardware/software, user/system • Implementation of these abstractions • Communication architecture as well • Abstractions: communication and synchronization operations • Implementations of these abstractions • Programming model • Abstractions • Implementations of these abstractions Computer Architecture II

CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Har dwar e/softwar e boundary Communication har dwar e Physical communication medium Modern Layered Framework Layers of architectural abstraction Computer Architecture II

Programming Model • What programmer uses in coding applications • Specifies communication and synchronization • Examples: • Multiprogramming: no communication or synch. at program level • Shared address space: like bulletin board • Message passing: like letters or phone calls, explicit point to point • Data parallel: global simultaneous actions on data • Implemented with shared address space or message passing Computer Architecture II

CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Har dwar e/softwar e boundary Communication har dwar e Physical communication medium Modern Layered Framework Layers of architectural abstraction Computer Architecture II

Communication Abstraction • Programming model is built on communication abstraction • Possibilities • Supported directly by hardware • OS (sockets) • user software • Combination OS/hardware: page fault + OS handler • Earlier • Communication abstraction oriented toward programming model • Today: • Compilers and software play important roles as bridges today (MPI/OpenMP) Computer Architecture II

Shared Address Space (SAS) Architectures • Any processor can directly reference any memory location • Communication occurs implicitly as result of loads and stores • Convenient: • Location transparency • Similar programming model to time-sharing on uniprocessors • Except processes run on different processors • Naturally provided on wide range of platforms • History dates at least to precursors of mainframes in early 60s • Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model • memory may be physically distributed among processors • UMA • NUMA Computer Architecture II

SAS-UMA (Uniform Memory Access) • Any processor can directly reference any memory location • Theoretical same access time for all accesses Mk M1 M2 … Interconnect P1 P2 … Pn Computer Architecture II

SAS-NUMA (Non Uniform Memory Access) • Any processor can directly reference any memory location (including the memory of remote processors) • NI integrated into the memory system • IMPORTANT DIFFERENCE! For message passing machines P1 can not access M2 (NI integrated into the I/O system) • Different access times for local and remote memory Pn P2 P1 … Mn M2 M1 PEn PE1 PE2 Interconnect Computer Architecture II

Machine physical address space Virtual address spaces for a collection of processes communicating via shared addresses P p r i v a t e n L o a d P n Common physical addresses P 2 P 1 P 0 S t o r e P p r i v a t e 2 Shared portion of address space P p r i v a t e 1 Private portion of address space P p r i v a t e 0 SAS Memory Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared • Writes to shared address visible to other threads (in other processes too) • Natural extension of uniprocessors model: • Communication: R/W the memory • Synchronization: special atomic operations (we come back later) • ONE OS uses shared memory to coordinate processes Computer Architecture II

Communication Hardware • Natural extension of uniprocessor • Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort • Memory capacity increases by adding modules • I/O by adding Controllers • Add processors for processing! Computer Architecture II

History • “Mainframe” approach • Motivated by multiprogramming • Extends crossbar used for more memory and I/O bandwidth • Bandwidth scales with nr of processors • Originally processor cost high • later, cost of crossbar: use multistage • Multistage • Reduces the incremental cost • Increased latency • “Minicomputer” approach • Almost all microprocessor systems have bus • Used heavily for parallel computing • Called symmetric multiprocessor (SMP) • Latency larger than for uniprocessor • Bus is bandwidth bottleneck • caching is key: coherence problem • Low incremental cost M M M M P I/O I/O P $ $ Computer Architecture II

SAS UMA Example: Intel Pentium Pro Quad • All coherence and multiprocessing in the processor module • Highly integrated, targeted at high volume • Low latency and bandwidth Computer Architecture II

SAS UMA Example: SUN UltraSPARC-based Enterprise • 16 cards of either type: processors + memory, or I/O • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus Computer Architecture II

NUMA PE1 PE2 PEn P1 P2 Pn C1 C2 Cn M1 M2 Mn Interconnect Computer Architecture II

SAS-NUMA Example: Cray T3E • Scale up to 1024 processors, 480MB/s links • Memory controller generates comm. request for non-local references (no local caching, SGI Origin has) • No hardware mechanism for coherence (SGI Origin has) Computer Architecture II

Message Passing Architectures High-level block diagram similar to distributed-memory SAS • NIC integrated into I/O system, needn’t be into memory system • Clusters, but tighter integration • Easier to build than scalable SAS Pn P2 P1 … Mn M2 M1 PEn PE1 PE2 Interconnect Computer Architecture II

Message Passing Architectures • Communication • via explicit I/O operations • In SAS case through memory accesses • Programming model: • directly access only private address space (local memory) • comm. via explicit messages (send/receive) • farther from hardware operations • Library (MPI) • OS intervention ( ex: page fault in page DSM) Computer Architecture II

Match Receive Y , P , t Addr ess Y Send X, Q, t Addr ess X Local pr ocess Local pr ocess addr ess space addr ess space Pr ocess P Pr ocess Q Message-Passing Abstraction • Send specifies buffer to be transmitted and receiving process • Recv specifies sending process and a buffer to receive into • Optional tag on send and matching rule on receive • Many overheads: copying, buffer management, protection Computer Architecture II

Evolution of Message-Passing Machines • Early machines: • Store and forward • FIFO on each link • Hardware close to programming model • synchronous ops • Only neighboring nodes named! • Replaced by DMA, enabling non-blocking ops • Buffered by system at destination until recv • Diminishing role of topology • topology less important • all nodes named • pipelined wormhole routing (asynchronous MP) • Cost is in node-network interface • Simplifies programming (earlier you had to map your program on the topology) Computer Architecture II

Example: IBM SP-2 • Made out of essentially complete RS6000 workstations • Network interface integrated in I/O bus (bw limited by I/O bus) • 8X8 Crossbar switch Computer Architecture II

Example Intel Paragon Computer Architecture II

SAS & MP Architectural Convergence • SAS machines • SOFTWARE: MP Send/recv supported via buffers • HARDWARE: At lower level, even hardware SAS passes hardware messages • MP machines • SOFTWARE: Constructed SAS global address space on MP (software DSM) • Page-based (or finer-grained) shared virtual memory • HARDWARE: Tighter NI integration even for MP (low-latency, high-bandwidth) • Due to the mergence of fast system area networks (SAN) • Traditional NI integrated into the memory system for SAS NUMA systems • Clusters of SMP workstations Computer Architecture II

Data Parallel Systems (SIMD) • Architectural model • SIMD: Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Original motivations • Matches simple differential equation solvers • We’ll see Ocean Current simulation • Centralize high cost of instruction fetch/sequencing Computer Architecture II

Data Parallel Systems • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element • After a phase of computation all the processors synchronize Computer Architecture II

Application of Data Parallelism • Each PE contains an employee record with his/her salary • Each PE has a condition flag: execute or not the instruction • Ex: work in parallel on several employer records If salary > 100K then salary = salary *1.05 else salary = salary *1.10 • Logically, the whole operation is a single step • Someprocessors enabled for arithmetic operation, others disabled • Other examples: • Differential equations, linear algebra, ... • Document searching, graphics, image processing, ... • Last machines: • Thinking Machines CM-1, CM-2 (and CM-5) • Maspar MP-1 and MP-2 Computer Architecture II

Data parallel machines evolution • Architecture disappeared today, but programming model still popular • Popular when cost savings of centralized sequencer high • 60s when CPU was a cabinet • Replaced by vectors in mid-70s • More flexible memory layout and easier to manage • No need to map the problem on the infrastructure • Revived in mid-80s when 32-bit datapath fit on chip • Modern microprocessors more attractive today • Other reasons for demise • Simple, regular applications have good locality, can do well anyway • Loss of applicability due to hardwiring data parallelism • MIMD machines as effective for data parallelism and more general Computer Architecture II

Convergence • Programming model • Still exists separated from hardware • converges to SPMD (single program multiple data) • Map local data structure on the HW machine model • HPF, OpenMP • Needed fast global synchronization • Global address space, implemented with either SAS or MP Computer Architecture II

Dataflow Architectures • Represent computation (program) as a graph of essential dependences • Logical processor at each node, activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution Computer Architecture II

Data-flow architectures • Key characteristics • Name operations anywhere in the machine • Support synchronization for independent ops • Dynamic scheduling at machine level • The architectures demised • Problems • Operations have locality across them, useful to group together • Handling complex data structures like arrays • Complexity of matching store and memory units • Expose too much parallelism • Too fine-grained • Hurts locality Computer Architecture II

Data-flow architectures convergence • Converged to use conventional processors and memory • Support for large, dynamic set of threads to map to processors • Typically shared address space as well • Separation of programming model from hardware (like data-parallel) • Lasting contributions: • Integration of communication with thread (handler) generation • Tightly integrated communication and fine-grained synchronization • Data-flow useful concept for software (compilers etc.) Computer Architecture II

Systolic Architectures • Replace single processor with array of regular processing elements • Orchestrate data flow for high throughput with less memory access • Different from pipelining • Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips • Represent algorithms directly by chips connected in regular pattern Computer Architecture II

´ ´ ´ ´ y ( i ) = w 1 x ( i ) + w 2 x ( i + 1) + w 3 x ( i + 2) + w 4 x ( i + 3) x 8 x 6 x 4 x 2 x 3 x 1 x 7 x 5 w 4 w 3 w 2 w 1 y 3 y 2 y 1 x in x out x out = x x x = x in ´ y out = y in + w x in w y in y out Systolic Arrays (contd.) Example: Systolic array for 1-D convolution • Practical realizations (e.g. iWARP CMU-Intel) use quite general processors • Enable variety of algorithms on same hardware • Dedicated interconnect channels • Data transfer directly from register to register across channel • Specialized, and same problems as SIMD • General purpose systems work well for same algorithms (locality etc.) Computer Architecture II

Recap: Generic Multiprocessor Architecture Computer Architecture II

Computer architecture II