1 / 25

CSE 661 PAPER PRESENTATION

This paper presentation discusses the on-chip interconnection architecture of the Tile processor, including the tile architecture, interconnect hardware, network uses, communication interfaces, and applications. The presentation also covers protection and shared memory communication and ordering.

jdewayne
Download Presentation

CSE 661 PAPER PRESENTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240

  2. OUTLINE OF PRESENTATION • Introduction • Tile64 Architecture • Interconnect Hardware • Network Uses • Network to Tile Interface • Receive-side Hardware Demultiplexing • Protection • Shared Memory Communication and Ordering • Interconnect Software • Communication Interface • Applications • Conclusion

  3. INTRODUCTION • Tile Processor’s five on-chip 2D mesh networks differ from • traditional bus based scheme; requires global broadcast hence not scalable beyond 8 – 16 cores • 1D ring not scalable; bisection BW is constant • Can support few or many processors with minimal changes to network structure

  4. TILE64 ARCHITECTURE • 2D grid of 64 identical compute elements (tiles) arranged in 8 x 8 mesh • 1GHz clock, 3-way VLIW, 192 bil. 32-bit instructions/sec • 4.8MB distributed cache, per tile TLB • Supports DMA and virtual memory • Tiles may run independent OSs. May be combined to run multiprocessor OS such as SMP Linux • Shared memory. • Cores directly access other cores’ cache through on-chip interconnects

  5. TILE64 ARCHITECTURE (2) Off chip memory BW ≤ 200Gbps I/O BW ≥ 40Gbps

  6. TILE64 ARCHITECTURE (3) Courtesy: http://www.tilera.com/products/processors/TILE64

  7. INTERCONNECT HARDWARE • 5 low latency mesh networks • Each network connects tile in five directions; north, south, east, west and processor • Each link made of two 32-bit unidirectional links

  8. INTERCONNECT HARDWARE(2) 1.28Tb/s BW in and out of a single tile

  9. NETWORK USES • 4 dynamic networks • packet header contains destination’s (x, y) coordinate and packet length (≤128 words) • Flow controlled, reliable delivery • UDN: low latency comm. between userland processes without OS intervention • IDN: direct communication with I/O devices • MDN: communication with off-chip memory • TDN: direct tile-to-tile transfers; requests through TDN, response through MDN • 1 static network • Streams of data instead of packets • First setup route, then send streams (circuit switched) • Also a userland network

  10. LOGICAL VS. PHYSICAL NETWORKS • 5 physically independent networks • Lots of free nearest neighbor on-chip wiring • Buffer space takes about 60% tile area vs 1.1% for each network • More reliable on-chip network => less buffering to manage link failure

  11. NETWORK TO TILE INTERFACE • Tiles have register access to on-chip networks. Instructions can read/write from/to UDN, IDN or STN. • MDN and UDN used indirectly on cache miss • Register-mapped network access is provided

  12. RECEIVE-SIDE HARDWARE DEMULTIPLEXING • Tag word = (sending node, stream num., message type) • Receiving hardware demultiplexes message into appropriate queue using tag. • On a tag miss, send data to ‘catch all’ queue, then raise interrupt • UDN has 4 deMUX queues, one ‘catch all’ • IDNhas 2 deMUX queues, one ‘catch all’ • 128-word reverse side buffering per tile

  13. RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)

  14. PROTECTION • Tile Architecture implements Multicore Hardwall (MH) • MH protects UDN, IDN and STN links • Standard memory protection mechanisms used for MDN and TDN • MH blocks attempts to send traffic over hardwalled link, then signals an interrupt to system software • Protection is implemented on outbound links

  15. SHARED MEMORY COMMUNICATION AND ORDERING • On-chip distributed shared cache • Data could be retrieved from • Local cache • Home tile (request sent through TDN). Data available only in home tile. Coherency maintained here. • Main Memory • No guaranteed ordering between networks and shared memory • Memory fence instructions used to enforce ordering

  16. INTERCONNECT SOFTWARE • C based iLib provides communication primitives implemented via UDN • Lightweight socket-like streaming channels for streaming algorithms • MPI-like message passing interface for adhoc messaging

  17. COMMUNICATION INTERFACES • iLib Socket • Long-lived FIFO point-to-point connection between two processes • Good for producer-consumer relationship • Multiple senders-one receiver possible; good for forwarding results to single node for aggregation • Raw Channels: low overhead; use as much space as available in buffer • Buffered Channels: higher overhead, but virtualization of memory is possible

  18. COMMUNICATION INTERFACES(2) • Message Passing API • Similar to MPI • Messages can be sent from a node to any other at all times • No need to establish connections • Implementation • Sender: Send packet with message key and size • Receiver’s catch-all queue interrupts processor • If expecting a message with this key, send packet to sender to begin transfer • Else, save notification. • On ilib_msg_receive() with same key, send packet to interrupt sender to begin transfer

  19. COMMUNICATION INTERFACES(3)

  20. COMMUNICATION INTERFACES(4) • UDN’s maximum BW is 4 bytes/cycle • Raw Channels’ max BW 3.93 bytes/cycle; overhead due to header word and tag word • Buffered Channel: Overhead of memory read/write • Message Passing: Overhead of interrupting receiving tile

  21. COMMUNICATION INTERFACES(5)

  22. APPLICATIONS • Corner Turn • Reorganize distributed array from 1 dimension to another • Each core send data to every other core • Important Factors • Network for Distribution (TDN using shared memory or UDN using raw channels) • Network for tiles’ synchronization (STN or UDN)

  23. APPLICATIONS (2) • Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data • Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages. • Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock

  24. APPLICATIONS (3) • Dot Product • Pairwise element multiplication, followed by addition of all products. • 65,536-element dot product • Shared memory not scalable, higher communication overhead • From 2 to 4 tiles, speedup is sublinear because dataset completely fits into tiles’ L2 cache.

  25. CONCLUSION • Tile uses unconventional architecture to achieve high on-chip communication BW • Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).

More Related