1 / 17

Infiniband architecture

Infiniband architecture Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org) Potential improvements. Infiniband architecture overview. Infiniband architecture overview Components: Links

benjamin
Download Presentation

Infiniband architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infiniband architecture • Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org) • Potential improvements

  2. Infiniband architecture overview

  3. Infiniband architecture overview • Components: • Links • Channel adaptors • Switches • Routers • The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. • Topology: • Irregular • Regular: Fat tree • Link speed: • 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X).

  4. Layers: somewhat similar to TCP/IP • Physical layer • Link layer • Error detection (CRC checksum) • flow control (credit based) • switching, virtual lanes (VL), • forwarding table computed by subnet manager • Not adaptive • Network layer: across subnets. • No use for the cluster environment • Transport layer • Reliable/unreliable, connection/datagram • Verbs: interface between adaptors and OS/Users

  5. Packet format: • Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet • Global Route Header (GRH): 40 Bytes. Used for routing between subnets • Base Transport header (BTH): 12 Bytes, for IBA transport • Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram • Datagram extended transport header (DETH): 8 bytes • RDMA extended transport header (RETH): 16 bytes • Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for small packets. • Invalidate • Invariant CRC and variant CRC: • CRC for fields not changed and changed.

  6. Local Route Header: • Switching based on the destination port address (LID) • Multipath switching by allocating multiple LIDs to one port

  7. Local Route Header: • Switching based on the destination port address (LID) • Multipath switching by allocating multiple LIDs to one port • GRH: same format as IPV6 address (16 bytes address)

  8. Base transport header:

  9. Verbs • OS/Users access the adaptor through verbs • Communication mechanism: Queue Pair (QP) • Support the four types of services, including reliable connection service • Each connection takes one QP on each end. • Each QP has a send queue and a receive queue. • Users can post send requests to the send queue and receive requests to the receive queue. • Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING • One receive operation (matching SEND)

  10. Queue Pair: • The status of the result of an operation (send/receive) is stored in the complete queue. • Send/receive queues can bind to different complete queues. • Related system level verbs: • Open QP, create complete queue, Open HCA, open protection domain, register memory, allocate memory window, etc • User level verbs: • post send/receive request, poll for completion.

  11. To communicate: • Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). • Post send/receive requests. • Check completion. • What if a packet arrives before a receive request is posted? • Not specified in the standard • The right response should be a ‘receiver not ready (RNR)’ error. The sender is back-pressed in this case.

  12. Infiniband has a perfect software interface (Chien'94 paper): • The network subsystem realizes all user level functionality. • User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS. • Network supports in-order delivery and and fault tolerance. • Buffer management is pushed out to the user.

  13. SilverStorm 9024: • 24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps) • switch type: cut-through • switch latency: < 140ns • switch bandwidth: 480 Gbps • forwarding table size: 48K • VL support: 8 + 1 management

  14. SilverStorm 9240: • 24 expansion slots, each expansion model 12 port 4X or 4 port 12X (24x12 = 288, 288 by 288 switch) • switch type cut-through • switch latency: < 140ns to < 420ns • switch bandwidth: 5.76Tbps • forwarding table size: 48K • VL support: 8 + 1 management

  15. Potential improvements on Infiniband using compiled communication • Improving the internal Infiniband fabric: • Offline routing for static pattern (static SM for a reduced traffic pattern) can be beneficial for irregular networks. • Simplify the layer architecture by having a direct link model (for known patterns), the header can be simplified, may not matter much (Infiniband layers are thin). • Simplify the protection mechanism. • Circuit switch type Infiniband. • Reliable communication protocol is still needed. • Potential benefits can be evaluated by simulation.

  16. Improving the messaging software (software to hardware interface): no chance. • Improving the MPI implementation over Infiniband: similar to our current work on Ethernet • Message scheduling for collective/point-to-point communications based on the network topology. • Exploring NIC features (buffers in NIC, multicast) • Reducing the number of instructions in a library routine makes sense. Compiled communication can be used to optimize the MPI library. • Compiled communication can help improving the library implementation (e.g. reducing the number of message copies, early requests posting , using RDMA, etc).

  17. One particular project: • Design algorithms for Infiniband subnet manager • Improving routing performance for Infiniband subnet manager (SM). • Objective: minimize the maximum channel load for an given traffic pattern • Optimize according to a given pattern: the traffic pattern in an application is usually not all-to-all • Default routing used in IBA SM • For a sparse traffic pattern, the maximum channel load can usually be minimized using the minimim interference principle. • Need to extend minimum interference routing for load balance deadlock free routing. • The best way to realize IBA SM is still not clear (unknown) at this time, we can probably do something here. • Irregular network or Fat tree network

More Related