CSE 661 PAPER PRESENTATION

CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240

OUTLINE OF PRESENTATION • Introduction • Tile64 Architecture • Interconnect Hardware • Network Uses • Network to Tile Interface • Receive-side Hardware Demultiplexing • Protection • Shared Memory Communication and Ordering • Interconnect Software • Communication Interface • Applications • Conclusion

INTRODUCTION • Tile Processor’s five on-chip 2D mesh networks differ from • traditional bus based scheme; requires global broadcast hence not scalable beyond 8 – 16 cores • 1D ring not scalable; bisection BW is constant • Can support few or many processors with minimal changes to network structure

TILE64 ARCHITECTURE • 2D grid of 64 identical compute elements (tiles) arranged in 8 x 8 mesh • 1GHz clock, 3-way VLIW, 192 bil. 32-bit instructions/sec • 4.8MB distributed cache, per tile TLB • Supports DMA and virtual memory • Tiles may run independent OSs. May be combined to run multiprocessor OS such as SMP Linux • Shared memory. • Cores directly access other cores’ cache through on-chip interconnects

TILE64 ARCHITECTURE (2) Off chip memory BW ≤ 200Gbps I/O BW ≥ 40Gbps

TILE64 ARCHITECTURE (3) Courtesy: http://www.tilera.com/products/processors/TILE64

INTERCONNECT HARDWARE • 5 low latency mesh networks • Each network connects tile in five directions; north, south, east, west and processor • Each link made of two 32-bit unidirectional links

INTERCONNECT HARDWARE(2) 1.28Tb/s BW in and out of a single tile

NETWORK USES • 4 dynamic networks • packet header contains destination’s (x, y) coordinate and packet length (≤128 words) • Flow controlled, reliable delivery • UDN: low latency comm. between userland processes without OS intervention • IDN: direct communication with I/O devices • MDN: communication with off-chip memory • TDN: direct tile-to-tile transfers; requests through TDN, response through MDN • 1 static network • Streams of data instead of packets • First setup route, then send streams (circuit switched) • Also a userland network

LOGICAL VS. PHYSICAL NETWORKS • 5 physically independent networks • Lots of free nearest neighbor on-chip wiring • Buffer space takes about 60% tile area vs 1.1% for each network • More reliable on-chip network => less buffering to manage link failure

NETWORK TO TILE INTERFACE • Tiles have register access to on-chip networks. Instructions can read/write from/to UDN, IDN or STN. • MDN and UDN used indirectly on cache miss • Register-mapped network access is provided

RECEIVE-SIDE HARDWARE DEMULTIPLEXING • Tag word = (sending node, stream num., message type) • Receiving hardware demultiplexes message into appropriate queue using tag. • On a tag miss, send data to ‘catch all’ queue, then raise interrupt • UDN has 4 deMUX queues, one ‘catch all’ • IDNhas 2 deMUX queues, one ‘catch all’ • 128-word reverse side buffering per tile

RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)

PROTECTION • Tile Architecture implements Multicore Hardwall (MH) • MH protects UDN, IDN and STN links • Standard memory protection mechanisms used for MDN and TDN • MH blocks attempts to send traffic over hardwalled link, then signals an interrupt to system software • Protection is implemented on outbound links

SHARED MEMORY COMMUNICATION AND ORDERING • On-chip distributed shared cache • Data could be retrieved from • Local cache • Home tile (request sent through TDN). Data available only in home tile. Coherency maintained here. • Main Memory • No guaranteed ordering between networks and shared memory • Memory fence instructions used to enforce ordering

INTERCONNECT SOFTWARE • C based iLib provides communication primitives implemented via UDN • Lightweight socket-like streaming channels for streaming algorithms • MPI-like message passing interface for adhoc messaging

COMMUNICATION INTERFACES • iLib Socket • Long-lived FIFO point-to-point connection between two processes • Good for producer-consumer relationship • Multiple senders-one receiver possible; good for forwarding results to single node for aggregation • Raw Channels: low overhead; use as much space as available in buffer • Buffered Channels: higher overhead, but virtualization of memory is possible

COMMUNICATION INTERFACES(2) • Message Passing API • Similar to MPI • Messages can be sent from a node to any other at all times • No need to establish connections • Implementation • Sender: Send packet with message key and size • Receiver’s catch-all queue interrupts processor • If expecting a message with this key, send packet to sender to begin transfer • Else, save notification. • On ilib_msg_receive() with same key, send packet to interrupt sender to begin transfer

COMMUNICATION INTERFACES(3)

COMMUNICATION INTERFACES(4) • UDN’s maximum BW is 4 bytes/cycle • Raw Channels’ max BW 3.93 bytes/cycle; overhead due to header word and tag word • Buffered Channel: Overhead of memory read/write • Message Passing: Overhead of interrupting receiving tile

COMMUNICATION INTERFACES(5)

APPLICATIONS • Corner Turn • Reorganize distributed array from 1 dimension to another • Each core send data to every other core • Important Factors • Network for Distribution (TDN using shared memory or UDN using raw channels) • Network for tiles’ synchronization (STN or UDN)

APPLICATIONS (2) • Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data • Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages. • Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock

APPLICATIONS (3) • Dot Product • Pairwise element multiplication, followed by addition of all products. • 65,536-element dot product • Shared memory not scalable, higher communication overhead • From 2 to 4 tiles, speedup is sublinear because dataset completely fits into tiles’ L2 cache.

CONCLUSION • Tile uses unconventional architecture to achieve high on-chip communication BW • Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).

CSE 661 PAPER PRESENTATION

CSE 661 PAPER PRESENTATION

Presentation Transcript

Paper Presentation

Paper presentation

PAPER PRESENTATION

Paper Presentation

Paper Presentation Template

Paper Presentation

Paper Presentation

Paper Presentation

Paper Presentation

Paper Presentation

PAPER PRESENTATION

OMIS 661 Presentation – Intelligent System Case

Working paper presentation

CSE 661 PAPER PRESENTATION

PAPER PRESENTATION on

Paper Presentation

TERM PAPER PRESENTATION

COE 502 / CSE 661 Parallel and Vector Architectures

Paper / Presentation

Presented by: Ahmad Hammad Course: CSE 661 - Fall 2011

BUS 661 PAPER Inspiring Minds/bus661paper.com

PAPER PRESENTATION