1 / 30

Persistent Memory over Fabrics An Application-centric view

Persistent Memory over Fabrics An Application-centric view. Paul Grun Cray, Inc OpenFabrics Alliance Vice C hair. Agenda. OpenFabrics Alliance Intro OpenFabrics Software Introducing OFI - the OpenFabrics Interfaces Project OFI Framework Overview – Framework, Providers

paniz
Download Presentation

Persistent Memory over Fabrics An Application-centric view

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Persistent Memory over Fabrics An Application-centric view Paul Grun Cray, Inc OpenFabrics Alliance Vice Chair

  2. Agenda OpenFabrics Alliance Intro OpenFabrics Software Introducing OFI - the OpenFabrics Interfaces Project OFI Framework Overview – Framework, Providers Delving into Data Storage / Data Access Three Use Cases A look at Persistent Memory

  3. OpenFabrics Alliance The OpenFabrics Alliance (OFA) is an open source-based organization that develops, tests, licenses, supports and distributes OpenFabrics Software (OFS). The Alliance’s mission is to develop and promote software that enables maximum application efficiency by delivering wire-speed messaging, ultra-low latencies and maximum bandwidth directly to applications with minimal CPU overhead.   https://openfabrics.org/index.php/organization.html

  4. OpenFabrics Alliance – selected statistics • Founded in 2004 • Leadership • Susan Coulter, LANL – Chair • Paul Grun, Cray Inc. – Vice Chair • Bill Lee, Mellanox Inc. – Treasurer • Chris Beggio, Sandia – Secretary (acting) • 14 active Directors/Promoters (Intel, IBM, HPE, NetApp, Oracle, Unisys, Nat’l Labs…) • Major Activities • Develop and support open source network stacks for high performance networking • OpenFabrics Software - OFS • Interoperability program (in concert with the University of New Hampshire InterOperability Lab) • Annual Workshop – March 27-31, Austin TX • Technical Working Groups • OFIWG, DS/DA, EWG, OFVWG, IWG • https://openfabrics.org/index.php/working-groups.html (archives, presentations, all publicly available) today’s focus

  5. OpenFabrics Software (OFS) OpenFabrics Software Open Source APIs and software for advanced networks Emerged along with the nascent InfiniBand industry in 2004 Soon thereafter expanded to include other IB-based networks such as RoCE and iWARP Wildly successful, to the point that RDMA technology is now being integrated upstream. Clearly, people like RDMA. OFED distribution and support: managed by the Enterprise Working Group – EWG Verbs development: managed by the OpenFabrics Verbs Working Group - OFVWG verbs api uverbs kverbs IB iWARP IB RoCE phy wire IB ENET ENET

  6. Historically, network APIs have been developed ad hoc as part of the development of a new network. To wit: today’s Verbs API is the implementation of the verbs semantics specified in the InfiniBand Architecture. But what if a network API was developed that catered specifically to the needs of its consumers, and the network beneath it was allowed to develop organically? What would be the consumer requirements? What would such a resulting API look like?

  7. Introducing the OpenFabrics Interfaces Project • OpenFabric Interfaces Project (OFI) • Proposed jointly by Cray and Intel, August 2013 • Chartered by the OpenFabrics Alliance, w/ Cray and Intel as co-chairs • Objectives • ‘Transport neutral’ network APIs • ‘Application centric’, driven by application requirements “Transport Neutral, Application-Centric” OFI Charter Develop, test, and distribute: • An extensible, open source framework that provides access to high-performance fabric interfaces and services • Extensible, open source interfaces aligned with ULP and application needs for high-performance fabric services OFIWG will not create specifications, but will work with standards bodies to create interoperability as needed 7

  8. OpenFabrics Software (OFS) OpenFabrics Software verbs ofi api Result: An extensible interface driven by application requirements Support for multiple fabrics Exposes an interface written in the language of the application uverbs libfabric kverbs kfabric . . . IB provider provider provider IB iWARP RoCE . . . phy wire IB ENET ENET fabric fabric fabric

  9. OFI Framework • OFI consists of two major components: • a set of defined APIs (and some functions) – libfabric, kfabric • a set of wire-specific ‘providers’ - ‘OFI providers’ • Think of the ‘providers’ as an implementation of the API on a given wire Application a series of interfaces to access fabric services MPI API Service I/F Service I/F Service I/F Service I/F OFI Providers … Provider A Provider B Provider C hardware Provider exports the services defined by the API NIC NIC NIC

  10. OpenFabrics Interfaces - Providers All providers expose the same interfaces, None of them expose details of the underlying fabric. ofi libfabric / kfabric libfabric kfabric sockets provider GNI provider verbs provider verbs provider (RoCEv2, iWARP) usnicprovider . . . … provider provider provider . . . ENET ARIES IB ENET fabric fabric fabric ENET Current providers: sockets, verbs, usnic, gni, mlx, psm, psm2, udp, bgq

  11. OFI Framework – a bit more detail

  12. OFI Project Overview – work group structure OFI Project Data Storage, Data Access Distributed and Parallel Computing Legacy Apps Data Analysis • Sockets apps • IP apps • Structured data • Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory • Msg Passing • MPI applications Shared memory - PGAS languages DS/DA WG OFI WG Application-centric design means that the working groups are driven by use cases: Data Storage / Data Access, Distributed and Parallel Computing…

  13. OFI Project Overview – work group structure OFI Project Data Storage, Data Access Distributed and Parallel Computing Legacy Apps Data Analysis • Sockets apps • IP apps • Structured data • Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory • Msg Passing • MPI applications Shared memory - PGAS languages kernel mode ‘kfabric’ user mode ‘libfabric’ (Not quite right because you can imagine user mode storage, e.g. CEPH)

  14. OFI Project Overview – work group structure OFI Project Data Storage, Data Access Distributed and Parallel Computing Legacy Apps Data Analysis • Sockets apps • IP apps • Structured data • Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory • Msg Passing • MPI applications Shared memory - PGAS languages kernel mode ‘kfabric’ user mode ‘libfabric’ Since the topic today is Persistent Memory…

  15. Data Storage, Data Access? Data Storage / Data Access DS - object, file, block - storage class mem DA - persistent memory storage client user or kernel app POSIX I/F load/store, memcopy… DIMM DIMM NVDIMM NVDIMM file system virtual switch NVM devices (SSDs…) Reminder: libfabric: User mode library kfabric : Kernel modules provider NIC remote storage class memory, persistent memory

  16. DS/DA’s Charter remote PM remote storage local mem tolerance for latency uS pS mS Data Access Data Storage At what point does remote storage begin to look instead like remote persistent memory? How would applications treat a pool of remote persistent memory? Key Use Cases: Lustre NVMe Persistent Memory

  17. Data Storage Example - LNET Basic LNET stack Cray Aries stack TCP/IP stack IB stack kfabric stack Lustre Lustre Lustre, DVS… LNET app Lustre An LND for every fabric ptlrpc … … … … LNET … … … … kfabric LND socklnd o2iblnd gnilnd N/W API api sockets verbs provider N/W hardware Gemini Ethernet IB, Ethernet fabric Backport LNET to kfabric, and we’re done. Any kfabric provider after that will work with Lustre.

  18. Data Storage Example – NVMe verbs ofi kverbs kfabric IB . . . iWARP IB provider provider VERBS provider RoCE . . . IB ENET ENET fabric fabric fabric

  19. Data Storage – enhanced APIs for storage kernel application VFS / Block I/O / Network FS / LNET • Places where the OFA can help: • kfabric as a second native API for NVMe • kfabric as a possible ‘LND’ for LNET SCSI NVMe iSCSI NVMe/F kfabric kverbs sockets provider NIC HCA NIC, RNIC fabric-specific device IP IB fabric RoCE, iWarp

  20. A look at Persistent Memory remote PM remote storage local mem latency tolerance uS pS mS • Applications tolerate long delays for storage, but assume very low latency for memory • Storage systems are generally asynchronous, target driven, optimized for cost • Memory systems are synchronous, and highly optimized to deliver the lowest imaginable latency with no CPU stalls • Persistent Memory over fabrics is somewhere in between: • Much faster than storage, but not as fast as local memory • How to treat PM over fabrics? • Build tremendously fast remote memory networks, or • Find ways to hide the latency, making it look for the most part like memory, but cheaper and remote

  21. Two interesting PM use cases • Consumers that handle remote memory naturally… • SHMEM-based applications, PGAS languages… • Requirements: • optimized completion semantics to indicate that data is globally visible, • semantics to commit data to persistent memory • completion semantics to indicate persistence • …and those that don’t (but we wish they did) • HA use case • Requirements: • exceedingly fast remote memory bus, or a way to hide latency • the latter requires an ability to control when data is written to PM

  22. PM high availability use case user kernel user load/store file access mem mapping memory controller file system user NVDIMM NVDIMM DIMM DIMM cache writes store, store, store… libfabric kfabric flush provider provider persistent memory NIC persistent memory

  23. Data Access – completion semantics I/O bus mem bus PM device(s) user app • For ‘normal’ fabrics, the responder returns an ACK when the data has been received by the end point. • It may not be globally visible, and/or • It may not yet be persistent • Need an efficient mechanism for indicating when the data is globally visible, and is persistent PM client PM server provider NIC NIC

  24. Data Access – key fabric requirements Objectives: • Client controls commits to persistent memory • Distinct indications of global visibility, persistence • Optimize protocols to avoid round trips Responder Requester app memory persistent memory I/O pipeline fi_send, fi_rma requester (consumer) EP EP completion ack-a data is received Caution: OFA does not define wire protocols, it only defines the semantics seen by the consumer ack-b data is visible ack-c data is persistent

  25. Possible converged I/O stack kernel application user app user app kernel application VFS / Block I/O / Network FS / LNET byte access byte access VFS / Block Layer SRP, iSER, NVMe/F, NFSoRDMA, SMB Direct, LNET/LND,… iSCSI SCSI NVMe ulp libfabric kfabric sockets * kverbs provider provider PCIe mem bus PCIe SSD HBA SSD fabric-specific device NIC, RNIC HCA NIC NVDIMM local I/O local byte addressable remote byte addressable remote I/O * kfabric verbs provider exists today

  26. Discussion – “Between Two Ferns” Doug Voigt - SNIA NVM PM TWG Chair Paul Grun – OFA vice Chair, co-chair OFIWG, DS/DA Are we going in the right direction? Next steps?

  27. Thank You

  28. Backup

  29. kfabric API – very similar to existing libfabric kfabric API http://downloads.openfabrics.org/WorkGroups/ofiwg/dsda_kfabric_architecture/ Consumer APIs • kfi_getinfo() kfi_fabric() kfi_domain() kfi_endpoint() kfi_cq_open() kfi_ep_bind() • kfi_listen() kfi_accept() kfi_connect() kfi_send() kfi_recv() kfi_read() kfi_write() • kfi_cq_read() kfi_cq_sread() kfi_eq_read() kfi_eq_sread() kfi_close() … Provider APIs • kfi_provider_register()During kfi provider module load a call to kfi_provider_register() supplies the kfiapi with a dispatch vector for kfi_* calls. • kfi_provider_deregister()During kfi provider module unload/cleanup kfi_provider_deregister() destroys the kfi_* runtime linkage for the specific provider (ref counted).

  30. Data Storage – NVMe/F today kernel application VFS / Block I/O / Network FS / LNET NVMe/F is an extension, allowing access to an NVMe device over a fabric NVMe/F leverages the characteristics of verbs to good effect for verbs-based fabrics SCSI NVMe iSCSI NVMe/F kverbs sockets NIC HCA NIC, RNIC IP IB RoCE, iWarp

More Related