1 / 19

Synchronization and Communication in the T3E Multiprocessor

Synchronization and Communication in the T3E Multiprocessor. Background. T3E is the second of Cray’s massively scalable multiprocessors (after T3D) Both are scalable up to 2048 processing elements

finian
Download Presentation

Synchronization and Communication in the T3E Multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synchronization and Communication in the T3E Multiprocessor

  2. Background • T3E is the second of Cray’s massively scalable multiprocessors (after T3D) • Both are scalable up to 2048 processing elements • Shared memory systems, but programmable using message passing (PVM or MPI, “more portable”) or shared memory (HPF)

  3. Challenges • T3E (and T3D) attempted to overcome the inherent limitations of employing commodity microprocessors in very large multiprocessors • Memory interface - cache-line based system makes references to single words inefficient • Typical address spaces too small for use in big systems • Non-cached references are often desirable (e.g. message to other processor)

  4. External structure in each PE to expand address space Shared address space 3D torus interconnect Pipelined remote memory access with prefetch queue and non-cached stores T3D Strengths (used in T3E)

  5. T3D: Room for improvement • Overblown barrier network • One outstanding cache line fill at a time (low load bandwidth) • Too many ways to access remote memory • Low single-node performance • Unoptimized special hardware features (block transfer engine, DTB Annex, dedicated message queues and registers)

  6. T3E Overview • Each PE contains Alpha 21164, local memory, and control and routing chips • Network links time-multiplexed at 5X system frequency • Self-hosted running Unicos/mk • No remote caching or board-level caches

  7. E-Registers • Extend physical address space • Increase attainable memory pipelining • Enable high single-word bandwidth • Provide mechanisms for data distribution, messaging, and atomic memory operations • In general, they improve on the inefficient individual structures of the T3D

  8. Operations with E-Registers • Appropriate operands are stored in appropriate E-registers by processor • Processor then issues another store command to initiate operation • Address specifies command and source or destination E-register • Data specifies pointer to already stored operands and remote address index

  9. Address Translation • Global virtual addresses and virtual PE numbers formed outside processors • Centrifuge used for efficient data distribution • Specifying memory location on data bus enables bigger address space

  10. Remote Reads/Writes • All operations done by reading into E-registers (Gets) or writing from E-registers to memory (Puts) • Vector forms transfer 8 words with arbitrary stride (e.g. every 3rd word) • Large number of E-registers allows significant Gets/Puts pipelining • Limited by bus interface (256B/26.7ns) • Single word load bandwidth high – can be loaded into contiguous E-registers, then moved into cache (instead of getting each cache line)

  11. Atomic Memory Operations • Fetch_&_Inc, Fetch_&_Add, Compare_&_Swap, Masked_Swap • Can be performed on any memory location • Performed like any E-register operation • Operands in E-registers • Triggered via store, sent over network • Result sent back and stored in specified E-register

  12. Messaging T3D: Specific queue location of fixed size T3E: Arbitrary number of queues, mapped to normal memory, of any size up to 128 MB T3D: All incoming messages generated interrupts, adding significant penalties T3E: Three options – interrupt, don’t interrupt (detected via polling), and interrupt after threshold number of messages

  13. Messaging Specifics • Message queues consist of Message Queue Control Words (MQCW) • Messages assembled into 8 E-registers, SEND issued with address of MQCW • Message queue is managed in software – avoids OS if polling is used

  14. Synchronization • Support for barriers and eurekas (message from one processor to group) • 32 barrier synchronization units (BSUs) at each processor, accessed as memory-mapped registers • Synchronization packets use a dedicated high-priority virtual channel • Propagated through logical tree embedded in 3D torus interconnect

  15. Synchronization • Simple barrier operation involves 2 states • First arms all processors in group (S_ARM) • Once all are armed, network notifies all of completion and processors return to S_BAR • Eureka requires 3 states to ensure one is received before issuing next one • Eureka notification immediately followed by barrier

  16. Performance Increasing number of E-registers allows greater pipelining and bandwidth (limited by control logic) Effective bandwidth greatly increases with higher transfer sizes due to effects of overhead, startup latency

  17. Performance Several million AMOs/sec required to saturate memory system and increase latency Transfer bandwidth independent of stride, except when data happens to be loaded from same bank(s) (multiples of 4, 8)

  18. Performance Very high message bandwidth is supported without latency increase Hardware barrier many times faster than efficient software barrier (about 15 for 1024 PEs)

  19. Conclusions • E-registers allow a highly pipelined memory system and provide a common interface for all global memory operations • Both messages and standard shared memory ops supported • Fast hardware barrier supported with almost no extra cost • No remote caching eliminates need for bulky coherence mechanisms and helps allow 2048 PE systems • Paper provides no means of quantitative comparison to alternative systems

More Related