Steven carter makia minich nageswara rao oak ridge national laboratory scarter minich rao @ornl gov
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Experimental Evaluation of Infiniband as Local- and Wide-Area Transport PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Steven Carter, Makia Minich, Nageswara Rao Oak Ridge National Laboratory {scarter,minich,[email protected] Experimental Evaluation of Infiniband as Local- and Wide-Area Transport. Motivation.

Download Presentation

Experimental Evaluation of Infiniband as Local- and Wide-Area Transport

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Steven carter makia minich nageswara rao oak ridge national laboratory scarter minich rao @ornl gov

Steven Carter, Makia Minich, Nageswara Rao

Oak Ridge National Laboratory

{scarter,minich,[email protected]

Experimental Evaluation of Infiniband as Local- and Wide-Area Transport


Motivation

Motivation

  • The Department of Energy established the Leadership Computing Facility at ORNL’s Center for Computational Sciences to field a 1PF supercomputer

  • The design chosen, the Cray XT series, includes an internal Lustre filesystem capable of sustaining reads and writes of 240GB/s

  • The problem with making the filesystem part of the machine is that it limits the flexibly of the Lustre filesystem and increases the complexity of the Cray

  • The problem with decoupling the filesystem from the machine is the high cost involved with to connect it via 10GE at the required speeds


Solution infiniband lan wan

Solution – Infiniband LAN/WAN

  • The Good:

    • Cool Name

    • Unified Fabric/IO Virtualization (i.e. low-latency interconnect, storage, and IP on one wire)

    • Faster link speeds (4x DDR = 16Gb/s, 4x QDR = 32Gb/s)

    • HCA does much of the heavy lifting

    • Nearly 10x less cost for similar bandwidth

    • Higher port density switches

  • The Bad:

    • IB sounds too much like IP

    • Lacks some of the accoutrements of IP/Ethernet (e.g. Firewall, Router, Sniffers (they exist but are hard to come by))

    • Cray does not support Infiniband – (we can fix that)


  • Ccs network roadmap summary

    CCS network roadmap summary

    Ethernet core scaled to match wide-area connectivity and archive

    Infiniband core scaled to match central file system and data transfer

    Lustre

    Baker

    Gateway

    Ethernet

    [O(10GB/s)]

    Infiniband

    [O(100GB/s)]

    Jaguar

    High-Performance

    Storage System

    (HPSS)

    Viz


    Ccs network 2007

    Spider10

    CCS Network 2007

    Infiniband

    Ethernet

    48-96 SDR

    32

    Jaguar

    128 DDR

    48

    48

    Spider60

    48

    64 DDR

    64 DDR

    48

    48

    Viz

    48

    32 DDR

    20

    4

    HPSS

    64 DDR

    4

    48

    24 SDR

    4

    Devel

    48

    87 SDR

    48

    E2E

    48

    3 SDR

    20 SDR

    20


    Ccs ib network 2008

    CCS IB network 2008

    20 SDR

    24 SDR

    Spider10

    Devel

    87 SDR

    Jaguar

    E2E

    48-96 SDR

    (16-32 SDR/link)

    64 DDR

    32 DDR

    Viz

    50 DDR

    HPSS

    50 DDR

    300 DDR

    (50 DDR/link)

    Baker

    Spider240


    Porting ib to the cray xt3

    Porting IB to the Cray XT3

    • PCI-X HCA required (PCIe should be available in the PF machine)

    • IB is not a standard option on the Cray XT3. Although the XT3's service nodes are based on SuSE Linux Enterprise 9, Cray kernel modifications make the kernel incompatible with stock version of OFED.

    • In order to compile OFED on the XT3 service node, the symbols bad_dma_address and dev_change_flags need to be exported from the IO node's kernel.

    • Furthermore, the OFED source code needs to be modified to recognize the particular version of the kernel run by the XT3's I/O nodes.


    Xt3 in lan testing

    Voltaire 9024

    Spider

    (Linux Cluster)

    Rizzo (XT3)

    XT3 IN LAN Testing

    Utilizing a server on spider (a commodity x86_64 Linux cluster), we were able to show the first successful infiniband implementation on the XT3.

    The basic RDMA test (as provided by the Open Fabrics Enterprise Distribution), allows us to see the maximum bandwidth that we could achieve (about 900MB/s unidirectionally) from the XT3's 133MHz PCI-X bus.


    Verb level lan testing xt3

    UC/RC: 4.4 Gb/s

    Verb Level LAN Testing (XT3)

    • IPoIB: 2.2 Gb/s

    • SDP: 3.6 Gb/s


    Verb level lan testing generic x86 64

    RC/UC: 7.5 Gb/s

    Verb Level LAN Testing (generic x86_64)

    • IPoIB: 2.4 Gb/s

    • SDP: 4.6 Gb/s


    Lan observations

    LAN Observations

    • XT3's performance is good (better than 10GE) for RDMA

    • XT3's performance is not as good for verb level UC/UD/RC tests

    • XT3's IPoIB/SDP performance is worse than normal

    • XT3's poor performance might be a result of PCI-X HCA (known to be sub-optimal) or its relatively anaemic processor (single processor for XT3 vs. dual processor on the generic x86_64 host)

    • In general, IB should fit our LAN needs by providing good performance to Lustre and allow for the use of legacy applications such as HPSS over SDP

    • Although the XT3's performance is not ideal, it is as good as 10GE and is able to leverage it ~100 I/O nodes to make up the difference.


    Ib over wan testing

    4x Infiniband SDR

    OC-192 SONET

    Lustre

    End-to-End

    IB over WAN testing

    Ciena

    CD-CI

    (SNV)

    • Placed 2 x Obsidian Longbow devices between Voltaire 9024 and Voltaire 9288

    • Provisioned loopback circuits of various lengths on the DOE UltraScience Network and ran test.

    • RDMA Test Results:

      Local: 7.5Gbps (Longbow to Longbow)

      ORNL <-> ORNL (0.2mile): 7.5Gbps

      ORNL <-> Chicago (1400miles): 7.46Gbps

      ORNL <-> Seattle (6600 miles): 7.23Gbps

      ORNL <-> Sunnyvale (8600 miles): 7.2Gbps

    DOE UltraScience Network

    Obsidian

    Longbow

    Obsidian

    Longbow

    Ciena

    CD-CI

    (ORNL)

    Voltaire

    9288

    Voltaire

    9024

    20

    81


    Chicago loopback 1400 miles

    UC: 7.5Gb/s

    RC: 7.5 Gb/s

    Chicago Loopback (1400 Miles)

    • IPoIB: ~450Mb/s

    • SDP: ~500Mb/s


    Seattle loopback 6600 miles

    UC: 7.2 Gb/s

    RC: 6.4 Gb/s

    Seattle loopback (6600 miles)

    • IPoIB: 120 Mb/s

    • SDP: 120 Mb/s


    Sunnyvale loopback 8600 miles

    UC: 7.2 Gb/s

    RC: 6.8 Gb/s

    Sunnyvale loopback (8600 miles)

    • IPoIB: 75 Mb/s

    • SDP: 95 Mb/s


    Maximum number of messages in flight

    Seattle: 250

    Maximum number of messages in flight

    • Sunnyvale: 240


    Wan observations

    WAN observations

    • The Obsidian Longbows appear to be extending sufficient link-level credits (UC works great)

    • RC only performs well at large messages sizes

    • There seems to be a maximum number of messages allowed in flight (~250)

    • RC performance does not increase rapidly enough even when message cap is not an issue

    • SDP performance likely poor due to the small message size used (page size) and poor RC performance

    • IPoIB performance likely due to interaction b/w IB and known TCP/IP problems

    • What the problem?


    Obsidian s observations jason gunthorpe

    Obsidian's observations (Jason Gunthorpe)

    • Simulated distance with delay built into the Longbow

    • On Mellanox HCAs, the best performance is achieved with 240 operations outstanding at once on a single RC QP. Greater values result in 10% less bandwidth

    • The behaviour is different on PathScale HCAs (RC, 120ms delay):

      • 240 outstanding, 2MB message size: 679.002 MB/sec

      • 240 outstanding, 256K message size: 494.573 MB/sec

      • 250 outstanding, 256K message size: 515.604 MB/sec

      • 5000 outstanding, 32K message size: 763.325 MB/sec

      • 2500 outstanding, 32K message size: 645.292 MB/sec

    • Patching was required to fix QP timeout and 2K ACK window

    • Patched code yielded a max of 680 MB/s w/o delay


    Conclusion

    Conclusion

    • Infiniband makes a great data center interconnect (supports legacy applications via IBoIP and SDP)

    • There does not appear to be the same intrinsic problem with IB as there is with IP/Ethernet over long distances

    • The problem appears to be in the Mellanox HCA

    • This problem must be addressed to affectively use IB for Lustre and legacy applications


    Contact

    Contact

    Steven Carter

    Network Task Lead

    Center for Computational Sciences

    Oak Ridge National Laboratory

    (865) 576-2672

    [email protected]

    20 Presenter_date


  • Login