steven carter makia minich nageswara rao oak ridge national laboratory scarter minich rao @ornl gov
Download
Skip this Video
Download Presentation
Experimental Evaluation of Infiniband as Local- and Wide-Area Transport

Loading in 2 Seconds...

play fullscreen
1 / 20

Experimental Evaluation of Infiniband as Local- and Wide-Area Transport - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Steven Carter, Makia Minich, Nageswara Rao Oak Ridge National Laboratory {scarter,minich,rao}@ornl.gov. Experimental Evaluation of Infiniband as Local- and Wide-Area Transport. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Experimental Evaluation of Infiniband as Local- and Wide-Area Transport' - chaim


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
steven carter makia minich nageswara rao oak ridge national laboratory scarter minich rao @ornl gov
Steven Carter, Makia Minich, Nageswara Rao

Oak Ridge National Laboratory

{scarter,minich,rao}@ornl.gov

Experimental Evaluation of Infiniband as Local- and Wide-Area Transport
motivation
Motivation
  • The Department of Energy established the Leadership Computing Facility at ORNL’s Center for Computational Sciences to field a 1PF supercomputer
  • The design chosen, the Cray XT series, includes an internal Lustre filesystem capable of sustaining reads and writes of 240GB/s
  • The problem with making the filesystem part of the machine is that it limits the flexibly of the Lustre filesystem and increases the complexity of the Cray
  • The problem with decoupling the filesystem from the machine is the high cost involved with to connect it via 10GE at the required speeds
solution infiniband lan wan
Solution – Infiniband LAN/WAN
  • The Good:
      • Cool Name
      • Unified Fabric/IO Virtualization (i.e. low-latency interconnect, storage, and IP on one wire)
      • Faster link speeds (4x DDR = 16Gb/s, 4x QDR = 32Gb/s)
      • HCA does much of the heavy lifting
      • Nearly 10x less cost for similar bandwidth
      • Higher port density switches
  • The Bad:
      • IB sounds too much like IP
      • Lacks some of the accoutrements of IP/Ethernet (e.g. Firewall, Router, Sniffers (they exist but are hard to come by))
      • Cray does not support Infiniband – (we can fix that)
ccs network roadmap summary
CCS network roadmap summary

Ethernet core scaled to match wide-area connectivity and archive

Infiniband core scaled to match central file system and data transfer

Lustre

Baker

Gateway

Ethernet

[O(10GB/s)]

Infiniband

[O(100GB/s)]

Jaguar

High-Performance

Storage System

(HPSS)

Viz

ccs network 2007

Spider10

CCS Network 2007

Infiniband

Ethernet

48-96 SDR

32

Jaguar

128 DDR

48

48

Spider60

48

64 DDR

64 DDR

48

48

Viz

48

32 DDR

20

4

HPSS

64 DDR

4

48

24 SDR

4

Devel

48

87 SDR

48

E2E

48

3 SDR

20 SDR

20

ccs ib network 2008
CCS IB network 2008

20 SDR

24 SDR

Spider10

Devel

87 SDR

Jaguar

E2E

48-96 SDR

(16-32 SDR/link)

64 DDR

32 DDR

Viz

50 DDR

HPSS

50 DDR

300 DDR

(50 DDR/link)

Baker

Spider240

porting ib to the cray xt3
Porting IB to the Cray XT3
  • PCI-X HCA required (PCIe should be available in the PF machine)
  • IB is not a standard option on the Cray XT3. Although the XT3\'s service nodes are based on SuSE Linux Enterprise 9, Cray kernel modifications make the kernel incompatible with stock version of OFED.
  • In order to compile OFED on the XT3 service node, the symbols bad_dma_address and dev_change_flags need to be exported from the IO node\'s kernel.
  • Furthermore, the OFED source code needs to be modified to recognize the particular version of the kernel run by the XT3\'s I/O nodes.
xt3 in lan testing

Voltaire 9024

Spider

(Linux Cluster)

Rizzo (XT3)

XT3 IN LAN Testing

Utilizing a server on spider (a commodity x86_64 Linux cluster), we were able to show the first successful infiniband implementation on the XT3.

The basic RDMA test (as provided by the Open Fabrics Enterprise Distribution), allows us to see the maximum bandwidth that we could achieve (about 900MB/s unidirectionally) from the XT3\'s 133MHz PCI-X bus.

lan observations
LAN Observations
  • XT3\'s performance is good (better than 10GE) for RDMA
  • XT3\'s performance is not as good for verb level UC/UD/RC tests
  • XT3\'s IPoIB/SDP performance is worse than normal
  • XT3\'s poor performance might be a result of PCI-X HCA (known to be sub-optimal) or its relatively anaemic processor (single processor for XT3 vs. dual processor on the generic x86_64 host)
  • In general, IB should fit our LAN needs by providing good performance to Lustre and allow for the use of legacy applications such as HPSS over SDP
  • Although the XT3\'s performance is not ideal, it is as good as 10GE and is able to leverage it ~100 I/O nodes to make up the difference.
ib over wan testing

4x Infiniband SDR

OC-192 SONET

Lustre

End-to-End

IB over WAN testing

Ciena

CD-CI

(SNV)

  • Placed 2 x Obsidian Longbow devices between Voltaire 9024 and Voltaire 9288
  • Provisioned loopback circuits of various lengths on the DOE UltraScience Network and ran test.
  • RDMA Test Results:

Local: 7.5Gbps (Longbow to Longbow)

ORNL <-> ORNL (0.2mile): 7.5Gbps

ORNL <-> Chicago (1400miles): 7.46Gbps

ORNL <-> Seattle (6600 miles): 7.23Gbps

ORNL <-> Sunnyvale (8600 miles): 7.2Gbps

DOE UltraScience Network

Obsidian

Longbow

Obsidian

Longbow

Ciena

CD-CI

(ORNL)

Voltaire

9288

Voltaire

9024

20

81

chicago loopback 1400 miles
UC: 7.5Gb/s

RC: 7.5 Gb/s

Chicago Loopback (1400 Miles)
  • IPoIB: ~450Mb/s
  • SDP: ~500Mb/s
seattle loopback 6600 miles
UC: 7.2 Gb/s

RC: 6.4 Gb/s

Seattle loopback (6600 miles)
  • IPoIB: 120 Mb/s
  • SDP: 120 Mb/s
sunnyvale loopback 8600 miles
UC: 7.2 Gb/s

RC: 6.8 Gb/s

Sunnyvale loopback (8600 miles)
  • IPoIB: 75 Mb/s
  • SDP: 95 Mb/s
wan observations
WAN observations
  • The Obsidian Longbows appear to be extending sufficient link-level credits (UC works great)
  • RC only performs well at large messages sizes
  • There seems to be a maximum number of messages allowed in flight (~250)
  • RC performance does not increase rapidly enough even when message cap is not an issue
  • SDP performance likely poor due to the small message size used (page size) and poor RC performance
  • IPoIB performance likely due to interaction b/w IB and known TCP/IP problems
  • What the problem?
obsidian s observations jason gunthorpe
Obsidian\'s observations (Jason Gunthorpe)
  • Simulated distance with delay built into the Longbow
  • On Mellanox HCAs, the best performance is achieved with 240 operations outstanding at once on a single RC QP. Greater values result in 10% less bandwidth
  • The behaviour is different on PathScale HCAs (RC, 120ms delay):
    • 240 outstanding, 2MB message size: 679.002 MB/sec
    • 240 outstanding, 256K message size: 494.573 MB/sec
    • 250 outstanding, 256K message size: 515.604 MB/sec
    • 5000 outstanding, 32K message size: 763.325 MB/sec
    • 2500 outstanding, 32K message size: 645.292 MB/sec
  • Patching was required to fix QP timeout and 2K ACK window
  • Patched code yielded a max of 680 MB/s w/o delay
conclusion
Conclusion
  • Infiniband makes a great data center interconnect (supports legacy applications via IBoIP and SDP)
  • There does not appear to be the same intrinsic problem with IB as there is with IP/Ethernet over long distances
  • The problem appears to be in the Mellanox HCA
  • This problem must be addressed to affectively use IB for Lustre and legacy applications
contact
Contact

Steven Carter

Network Task Lead

Center for Computational Sciences

Oak Ridge National Laboratory

(865) 576-2672

[email protected]

20 Presenter_date

ad