250 likes | 266 Views
This paper presentation discusses the on-chip interconnection architecture of the Tile processor, including the tile architecture, interconnect hardware, network uses, communication interfaces, and applications. The presentation also covers protection and shared memory communication and ordering.
E N D
CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240
OUTLINE OF PRESENTATION • Introduction • Tile64 Architecture • Interconnect Hardware • Network Uses • Network to Tile Interface • Receive-side Hardware Demultiplexing • Protection • Shared Memory Communication and Ordering • Interconnect Software • Communication Interface • Applications • Conclusion
INTRODUCTION • Tile Processor’s five on-chip 2D mesh networks differ from • traditional bus based scheme; requires global broadcast hence not scalable beyond 8 – 16 cores • 1D ring not scalable; bisection BW is constant • Can support few or many processors with minimal changes to network structure
TILE64 ARCHITECTURE • 2D grid of 64 identical compute elements (tiles) arranged in 8 x 8 mesh • 1GHz clock, 3-way VLIW, 192 bil. 32-bit instructions/sec • 4.8MB distributed cache, per tile TLB • Supports DMA and virtual memory • Tiles may run independent OSs. May be combined to run multiprocessor OS such as SMP Linux • Shared memory. • Cores directly access other cores’ cache through on-chip interconnects
TILE64 ARCHITECTURE (2) Off chip memory BW ≤ 200Gbps I/O BW ≥ 40Gbps
TILE64 ARCHITECTURE (3) Courtesy: http://www.tilera.com/products/processors/TILE64
INTERCONNECT HARDWARE • 5 low latency mesh networks • Each network connects tile in five directions; north, south, east, west and processor • Each link made of two 32-bit unidirectional links
INTERCONNECT HARDWARE(2) 1.28Tb/s BW in and out of a single tile
NETWORK USES • 4 dynamic networks • packet header contains destination’s (x, y) coordinate and packet length (≤128 words) • Flow controlled, reliable delivery • UDN: low latency comm. between userland processes without OS intervention • IDN: direct communication with I/O devices • MDN: communication with off-chip memory • TDN: direct tile-to-tile transfers; requests through TDN, response through MDN • 1 static network • Streams of data instead of packets • First setup route, then send streams (circuit switched) • Also a userland network
LOGICAL VS. PHYSICAL NETWORKS • 5 physically independent networks • Lots of free nearest neighbor on-chip wiring • Buffer space takes about 60% tile area vs 1.1% for each network • More reliable on-chip network => less buffering to manage link failure
NETWORK TO TILE INTERFACE • Tiles have register access to on-chip networks. Instructions can read/write from/to UDN, IDN or STN. • MDN and UDN used indirectly on cache miss • Register-mapped network access is provided
RECEIVE-SIDE HARDWARE DEMULTIPLEXING • Tag word = (sending node, stream num., message type) • Receiving hardware demultiplexes message into appropriate queue using tag. • On a tag miss, send data to ‘catch all’ queue, then raise interrupt • UDN has 4 deMUX queues, one ‘catch all’ • IDNhas 2 deMUX queues, one ‘catch all’ • 128-word reverse side buffering per tile
PROTECTION • Tile Architecture implements Multicore Hardwall (MH) • MH protects UDN, IDN and STN links • Standard memory protection mechanisms used for MDN and TDN • MH blocks attempts to send traffic over hardwalled link, then signals an interrupt to system software • Protection is implemented on outbound links
SHARED MEMORY COMMUNICATION AND ORDERING • On-chip distributed shared cache • Data could be retrieved from • Local cache • Home tile (request sent through TDN). Data available only in home tile. Coherency maintained here. • Main Memory • No guaranteed ordering between networks and shared memory • Memory fence instructions used to enforce ordering
INTERCONNECT SOFTWARE • C based iLib provides communication primitives implemented via UDN • Lightweight socket-like streaming channels for streaming algorithms • MPI-like message passing interface for adhoc messaging
COMMUNICATION INTERFACES • iLib Socket • Long-lived FIFO point-to-point connection between two processes • Good for producer-consumer relationship • Multiple senders-one receiver possible; good for forwarding results to single node for aggregation • Raw Channels: low overhead; use as much space as available in buffer • Buffered Channels: higher overhead, but virtualization of memory is possible
COMMUNICATION INTERFACES(2) • Message Passing API • Similar to MPI • Messages can be sent from a node to any other at all times • No need to establish connections • Implementation • Sender: Send packet with message key and size • Receiver’s catch-all queue interrupts processor • If expecting a message with this key, send packet to sender to begin transfer • Else, save notification. • On ilib_msg_receive() with same key, send packet to interrupt sender to begin transfer
COMMUNICATION INTERFACES(4) • UDN’s maximum BW is 4 bytes/cycle • Raw Channels’ max BW 3.93 bytes/cycle; overhead due to header word and tag word • Buffered Channel: Overhead of memory read/write • Message Passing: Overhead of interrupting receiving tile
APPLICATIONS • Corner Turn • Reorganize distributed array from 1 dimension to another • Each core send data to every other core • Important Factors • Network for Distribution (TDN using shared memory or UDN using raw channels) • Network for tiles’ synchronization (STN or UDN)
APPLICATIONS (2) • Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data • Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages. • Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock
APPLICATIONS (3) • Dot Product • Pairwise element multiplication, followed by addition of all products. • 65,536-element dot product • Shared memory not scalable, higher communication overhead • From 2 to 4 tiles, speedup is sublinear because dataset completely fits into tiles’ L2 cache.
CONCLUSION • Tile uses unconventional architecture to achieve high on-chip communication BW • Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).