1 / 27

Early Experiences with NFS over RDMA

Early Experiences with NFS over RDMA. OpenFabric Workshop San Francisco, September 25, 2006. Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe Kenny Jeff Decker, and Noah Fischer hycsw,idcoehn,jcdecke, SAND 2006-4293C. Motivation RDMA technologies

Download Presentation

Early Experiences with NFS over RDMA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Early Experiences with NFS over RDMA OpenFabric Workshop San Francisco, September 25, 2006 Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe Kenny Jeff Decker, and Noah Fischer hycsw,idcoehn,jcdecke, SAND 2006-4293C

  2. Motivation RDMA technologies NFS over RDMA Testbed hardware and software Preliminary results and analysis Conclusion Ongoing work and Future Plans Outline

  3. What is NFS • A network attached storage file access protocol layered on RPC, typically carried over UDP/TCP over IP • Allow files to be shared among multiple clients across LAN and WAN • Standard, stable and mature protocol adopted for cluster platform

  4. Application 1 Application N Application 2 Concurrent I/O Concurrent I/O Concurrent I/O NFS Server NFS Scalability Concerns in Large Clusters • Large number of concurrent requests from parallel applications • Parallel I/O requests serialized by NFS to a large extend • Need RDMA and pNFS

  5. How DMA Works

  6. How RDMA Works

  7. Why NFS over RDMA • NFS moves big chunks of data incurring many copies with each RPC • Cluster Computing • High bandwidth and low latency • RDMA • Offload protocol processing • Offload host memory I/O bus • A must for 10/20 Gbps networks

  8. NFSv2 NFSv3 NFSv4 NLM NFSACL RPC XDR UDP TCP RDMA The NFS RDMA Architecture • NFS is a family of protocol layered over RPC • XDR encodes RPC requests and results onto RPC transports • NFS RDMA is implemented as a new RPC transport mechanism • Selection of transport is an NFS mount option Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad, “NFS over RDMA”, ACM SIGCOMM 2003Workshops, August 25-27, 2003

  9. This Study

  10. OpenFabrics Software Stack SA Subnet Administrator IP Based App Access Sockets BasedAccess Various MPIs Block Storage Access Clustered DB Access Access to File Systems Application Level MAD Management Datagram Diag Tools Open SM SMA Subnet Manager Agent User Level MAD API UDAPL User APIs PMA Performance Manager Agent InfiniBand OpenFabrics User Level Verbs / API iWARP User Space SDP Lib IPoIB IP over InfiniBand SDP Sockets Direct Protocol Kernel Space Upper Layer Protocol SRP SCSI RDMA Protocol (Initiator) IPoIB SDP SRP iSER RDS NFS-RDMA RPC Cluster File Sys iSER iSCSI RDMA Protocol (Initiator) Connection Manager Abstraction (CMA) RDS Reliable Datagram Service Mid-Layer SA Client MAD SMA Connection Manager ConnectionManager UDAPL User Direct Access Programming Lib Kernel bypass Kernel bypass HCA Host Channel Adapter InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC RDMA NIC Provider Hardware Specific Driver Hardware Specific Driver Key Common Apps & AccessMethodsfor usingOF Stack InfiniBand Hardware InfiniBand HCA iWARP R-NIC iWARP Offers a common, open source, and open development RDMA application programming interface

  11. Testbed Key Hardware • Mainboard: Tyan Thunder K8WE (S2895) • CPU – Dual 2.2 Ghz AMD Opteron Skt940 • Memory – 8 GB ATP 1GB PC3200 DDR SDRAM on NFS server and 2 GB CORSAIR CM725D512RLP-3200/M on client • IB Switch: Flextronics InfiniScale III 24-port switch • IB HCA: Mellanox MT25208 InfiniHost III Ex

  12. Testbed KeySoftware • Kernel: Linux with deadline I/O scheduler • NFS/RDMA release candidate 4 – • oneSIS used to boot all the nodes • OpenFabric IB stack svn 7442

  13. Server IB switch Clients Testbed Configuration • One NFS server and up to four clients • NFS/TCP vs. NFS/RDMA • IPoIB and IB RDMA running SDR • Ext2 with Software RAID0 backend • Clients ran IOZONE writing and reading 64KB records and 5GB aggregate file size used • To eliminate cache effect on client • To maintain consistent disk I/O on server Allowing the evaluation of NFS/RDMA transport without being constrained by disk I/O • System resources monitored using vmstat at 2s intervals

  14. Local, NFS, and NFS/RDMA Throughput • Reads are from server cache reflecting • TCP RPC transport achieved ~180 MB/s (1.4 Gb/s) of throughput • RDMA RPC transport was capable of delivering ~700MB/s (5.6Gb/s) throughput • RPCNFSDCOUNT=8 • /proc/sys/sunrpc/svc_rdma/max_requests=16

  15. NFS & NFS/RDMA Server Disk I/O • Writes incurred disk I/O issued according to deadline scheduler • NFS/RDMA server has higher incoming data rate, and thus higher block I/O output rate to disk • NFS/RDMA data-rate bottlenecked by the storage I/O rate as indicated by the higher IOWAIT time

  16. NFS vs. NFS/RDMA Client Interrupt and Context Switch • NFS/RDMA incurred ~1/8 of Interrupts, completed in a little more than 1/2 of the time • NFS/RDMA showed higher context-switch rates indicating faster processing of application requests Higher throughput comparing to NFS!

  17. Client CPU Efficiency • CPU per MB of transfer: • (Dt)*S%cpu/100 / file-size • Write • NFS 0.00375 • NFS/RDMA = 0.00144 • 61.86% more efficient! • Read • NFS = 0.00435 • NFS/RDMA = 0.00107 • 75.47% more efficient! • Improved application performance

  18. Server CPU Efficiency • CPU per MB of transfer: • (D t)*S%cpu/100 / file-size • Write • NFS = 0.00564 • NFS/RDMA = 0.00180 • 68.10% more efficient! • Read • NFS = 0.00362 • NFS/RDMA = 0.00055 • 84.70% more efficient! • Improved system performance

  19. Scalability Test - Throughput • To minimize the impact of disk I/O • One 5GB, two 2.5GB, three 1.67GB, four 1.25GB • Ignored rewrite and reread due to client-side cache effect

  20. Scalability Test – Server I/O • NFS RDMA transport demonstrated faster processing of concurrent RPC I/O requests and responses from and to the 4 clients than NFS • Concurrent NFS/RDMA writes were impacted more by our slow storage as indicated by the close to 80% CPU IOWAIT times

  21. Scalability Test – Server CPU • NFS/RDMA incurred ~½ the CPU overhead and for half of the duration, but delivered 4 times the aggregate throughput comparing to NFS • NFS/RDMA write-performance was impacted more by the backend storage than NFS, as indicated by the ~70% vs. ~30% idle CPU time waiting for IO to complete

  22. Preliminary Conclusion • Compared to NFS, NFS/RDMA demonstrated: • impressive CPU efficiency • and promising scalability NFS/RDMA will Improve application and system level performance! NFS/RDMA can easily take advantage of the bandwidth in 10/20 Gigabit network for large file accesses

  23. Ongoing Work • SC06 participation • HPC Storage Challenge Finalist • Micro benchmark • MPI Applications with POSIX and/or MPI I/O • Xnet NFS/RDMA demo over IB and iWARP

  24. Future Plans • Initiate study of NFSv4 pNFS performance with RDMA storage • Blocks (SRP, iSER) • File (NFSv4/RDMA) • Object (iSCSI-OSD)?

  25. NFSv3 Use of ancillary Network Lock Manager (NLM) protocol adds complexity and limits scalability in parallel I/O No attribute caching requirement squelches performance NFSv4 Use of Integrated lock management allows byte range locking required for Parallel I/O Compound operations improves efficiency of data movement and … Why NFSv4

  26. Why Parallel NFS (pNFS) • pNFS extends NFSv4 • Minimum extension to allow out-of-band I/O • Standards-based scalable I/O solution • Asymmetric, out-of-band solutions offer scalability • Control path (open/close) different from Data Path (read/write)

  27. Acknowledgement • The authors would like to thank the following for their technical input • Tom Talpey and James Lentini from NetApp • Tom Tucker from Open Grid Computing • James Ting from Mellanox • Matt Leininger and Mitch Sukalski from Sandia

More Related