Synchronization and Communication in the T3E Multiprocessor

Synchronization and Communication in the T3E Multiprocessor

Background • T3E is the second of Cray’s massively scalable multiprocessors (after T3D) • Both are scalable up to 2048 processing elements • Shared memory systems, but programmable using message passing (PVM or MPI, “more portable”) or shared memory (HPF)

Challenges • T3E (and T3D) attempted to overcome the inherent limitations of employing commodity microprocessors in very large multiprocessors • Memory interface - cache-line based system makes references to single words inefficient • Typical address spaces too small for use in big systems • Non-cached references are often desirable (e.g. message to other processor)

External structure in each PE to expand address space Shared address space 3D torus interconnect Pipelined remote memory access with prefetch queue and non-cached stores T3D Strengths (used in T3E)

T3D: Room for improvement • Overblown barrier network • One outstanding cache line fill at a time (low load bandwidth) • Too many ways to access remote memory • Low single-node performance • Unoptimized special hardware features (block transfer engine, DTB Annex, dedicated message queues and registers)

T3E Overview • Each PE contains Alpha 21164, local memory, and control and routing chips • Network links time-multiplexed at 5X system frequency • Self-hosted running Unicos/mk • No remote caching or board-level caches

E-Registers • Extend physical address space • Increase attainable memory pipelining • Enable high single-word bandwidth • Provide mechanisms for data distribution, messaging, and atomic memory operations • In general, they improve on the inefficient individual structures of the T3D

Operations with E-Registers • Appropriate operands are stored in appropriate E-registers by processor • Processor then issues another store command to initiate operation • Address specifies command and source or destination E-register • Data specifies pointer to already stored operands and remote address index

Address Translation • Global virtual addresses and virtual PE numbers formed outside processors • Centrifuge used for efficient data distribution • Specifying memory location on data bus enables bigger address space

Remote Reads/Writes • All operations done by reading into E-registers (Gets) or writing from E-registers to memory (Puts) • Vector forms transfer 8 words with arbitrary stride (e.g. every 3rd word) • Large number of E-registers allows significant Gets/Puts pipelining • Limited by bus interface (256B/26.7ns) • Single word load bandwidth high – can be loaded into contiguous E-registers, then moved into cache (instead of getting each cache line)

Atomic Memory Operations • Fetch_&_Inc, Fetch_&_Add, Compare_&_Swap, Masked_Swap • Can be performed on any memory location • Performed like any E-register operation • Operands in E-registers • Triggered via store, sent over network • Result sent back and stored in specified E-register

Messaging T3D: Specific queue location of fixed size T3E: Arbitrary number of queues, mapped to normal memory, of any size up to 128 MB T3D: All incoming messages generated interrupts, adding significant penalties T3E: Three options – interrupt, don’t interrupt (detected via polling), and interrupt after threshold number of messages

Messaging Specifics • Message queues consist of Message Queue Control Words (MQCW) • Messages assembled into 8 E-registers, SEND issued with address of MQCW • Message queue is managed in software – avoids OS if polling is used

Synchronization • Support for barriers and eurekas (message from one processor to group) • 32 barrier synchronization units (BSUs) at each processor, accessed as memory-mapped registers • Synchronization packets use a dedicated high-priority virtual channel • Propagated through logical tree embedded in 3D torus interconnect

Synchronization • Simple barrier operation involves 2 states • First arms all processors in group (S_ARM) • Once all are armed, network notifies all of completion and processors return to S_BAR • Eureka requires 3 states to ensure one is received before issuing next one • Eureka notification immediately followed by barrier

Performance Increasing number of E-registers allows greater pipelining and bandwidth (limited by control logic) Effective bandwidth greatly increases with higher transfer sizes due to effects of overhead, startup latency

Performance Several million AMOs/sec required to saturate memory system and increase latency Transfer bandwidth independent of stride, except when data happens to be loaded from same bank(s) (multiples of 4, 8)

Performance Very high message bandwidth is supported without latency increase Hardware barrier many times faster than efficient software barrier (about 15 for 1024 PEs)

Conclusions • E-registers allow a highly pipelined memory system and provide a common interface for all global memory operations • Both messages and standard shared memory ops supported • Fast hardware barrier supported with almost no extra cost • No remote caching eliminates need for bulky coherence mechanisms and helps allow 2048 PE systems • Paper provides no means of quantitative comparison to alternative systems

Synchronization and Communication in the T3E Multiprocessor

Synchronization and Communication in the T3E Multiprocessor

Presentation Transcript

Synchronization in Digital Communication

Inter Process Synchronization and Communication

Inter-Process Communication and Synchronization

Inter-Process Communication and Synchronization Chapter 9

Intertask Communication and Synchronization

INTERPROCESS COMMUNICATION AND SYNCHRONIZATION

Comparison of Communication and I/O of the Cray T3E and IBM SP

Sharing Data Safely: Interprocess Synchronization and Communication

Inter Process Synchronization and Communication

The Duality of Memory and Communication in the Implementation of a Multiprocessor Operating System

High Level Synchronization and Interprocess Communication

TotalView on the T3E and IBM SP Systems

Lecture 6: Inter-process Communication and Synchronization

Project 2: Inter-Process Communication and Synchronization

CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching

Inter-Process Communication and Synchronization

COMMUNICATION BETWEEN ADSP-TS201 MULTIPROCESSOR SYSTEM

Process Synchronization and Communication