1 / 46

Implementing PCI I/O Virtualization Standards

Implementing PCI I/O Virtualization Standards. Mike Krause and Renato Recio PCI SIG IOV Work Group Co-chairs. Today’s Approach To IOV. Virtualization Intermediaries (VI) and hypervisors are used to safely share IO Under this approach 1 or more System Images share the PCI device through a VI

derora
Download Presentation

Implementing PCI I/O Virtualization Standards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing PCI I/O Virtualization Standards Mike Krause and RenatoRecio PCI SIG IOV Work Group Co-chairs

  2. Today’s Approach To IOV • Virtualization Intermediaries (VI) and hypervisors are used to safely share IO • Under this approach • 1 or more System Images share the PCI device through a VI • Virtualization enablers are not needed in the either the Root Complex (RC) or PCIe Device • The VI is involved in all IO transactions and performs all IO Virtualization Functions VI Based PCI Device Sharing Example SI VI SI System Images share the adapter through a VI.MMIO and DMA operations go through the VI Hypervisor PCIe RC Today’s PCIe Device with one or more Functions.The Device may not be cognizant at all that it is being shared PCIe Port PCIe Device F

  3. Performance Of Today’s IOV Through-put* Latency* Native IOV VI Based IOV • VI based IOV adds path length on every IO operation • Native IOV significantly improves performance • For the example above Native IOVdoubles throughput and reduces latency by up to half • Several factors are increasing the virtualization use (e.g., more cores per socket, customer simplification requirements, …) • Making Native IOV even more important in the future Native IOV VI Based IOV *Source: Self-Virtualized I/O: High Performance, Scalable I/O Virtualization in Multi-core Systems; R. Himanshu, I. Ganev, K. Schwan - Georgia Tech and J. Xenidis - IBM

  4. PCI SIG IOV Overview PCIe Single-Root IOV PCIe Multi-Root IOV • PCI SIG is standardizing mechanisms that enable PCIe Devices to be directly shared, with no run-time overheads • Single-Root IOV – Direct sharing between SIs on a single system • Multi-Root IOV – Direct sharing between SIs on multiple systems • PCI-SIG IOV Specification covers “north-side” of the Device • “PCI IOV Usage Models and Implementations” session will cover examples of how PCI IOV specifications can be used tovirtualizePCIe Devices SI VI SI SI VI SI SI SI VI Hypervisor Hypervisor Hypervisor PCIe RC Blade Blade PCIe RC PCIe RC PCIeTopology PCIe Port PCIe IOV Capable Device PCIe MR-IOV Capable Enet Device PCIe MR-IOV Capable FC Device FC SAN FC SAN Ethernet LAN

  5. Terminology SI VI SI SR-PCIM VI • System Image (SI) • SW, e.g., a guest OS, to which virtual and physical devices can be assigned • Virtual Intermediary (VI) • Performs resource allocation, isolation, management and event handling • PCIM – PCI Manager • Controls configuration, management and error handling of PFs and VFs • May be in SW and/or Firmware. • May be integrated into a VI • Translation Agent (TA ) • Uses ATPT to translates PCI Bus Addresses into platform addresses • Address Translation and Protection Table (ATPT) • Validates access rights of incoming PCI memory transactions. • Translates PCI Address into platform physical addresses Hypervisor Processor Memory ATPT TA PCIeRootComplex PCIe Port PCIe Port PCIe Switch PCIe Port PCIe Port PCIe Device PCIe Device F F

  6. Terminology… Continued PCIe Single-Root IOV PCIe Multi-Root IOV • Single-Root IOV (SR-IOV) - A PCIe hierarchy with one or more components that support the SR-IOV Capability • Multi-Root Aware (MRA) - A PCIe component that supports the MR-IOV capability • Multi-Root IOV (MR-IOV) - A PCIe Topology containing one or more PCIe Virtual Hierarchies • Virtual Hierarchy (VH) - A portion of an MR Topology assigned to a PCIe hierarchy, where each VH has its own PCI Memory, IO, and Configuration Space SI VI SI SI VI SI SI SI VI Hypervisor Hypervisor Hypervisor PCIe RC Blade Blade PCIe RC PCIe RC PCIeFabric PCIe Port PCIe IOV Capable Device PCIe MRADevice PCIeDevice PCIe SR-IOV MRA Device

  7. PF0 PF0 VF1 VF1 VF2 VF2 : : VFN VFN Terminology… Continued PCIe Port PF0 • Address Translation Cache (ATC) • Cache of recent translations • Physical Function (PF) • Function that supports SR-IOV • Used to manage VFs • Virtual Function (VF) • Function that supports SR-IOV and shares resources with the PF it associated with • Base Function (BF) • Function that supports MR-IOV • Used to manage PFs and VFs on MRA Devices VF1 ATC1 Internal Routing Resources1 VF2 ATC2 Resources2 : VFN • PCIe SR-IOV Capable Device ATCN ResourcesN PCIe Port VH0 BF0 VH1 VHN Internal Routing …. • PCIe MR-IOV Capable Device

  8. PCI IOV Related Mechanisms Function Level Reset (FLR) Alternative Routing ID Interpretation(ARI) Address Translation Services Single-Root IO Virtualization Multi-Root IO Virtualization

  9. PCIe Topology FLR And ARI VI SI SI SI FLRs • FLR - Provides Function level granularity on resets • All software readable state must be cleared by an FLR • All outstanding transactions associated with the Function referenced by the FLR must be completed when the FLR is returned as completed • ARI - Extends Function number field from 3 to 8 bits • Allows up to 256 Functions or VFsper PCIe Device Hypervisor PCIe Root Complex ATPT PCIe Port PCIe Port PCIe DeviceFC HBA PCIe Device Function1 Function2 Routing Routing ConfigMgt VF1 PF0 VF2 VF3 Function3 FC Port FC Port FC SAN FC SAN

  10. PCIe Topology Address Translation Services VI SI SI SI • ATS is used to cache PCIe Memory Address translations in a PCI Device. • Consists of three new PCIe transactions • Request Translation Transaction – Used by a PCIe Device to request a translation of an untranslated address • Translated DMA Transaction – Used to perform a DMA that references a translated address • Invalidate Translation Transaction – Used to invalidatea previously exposed translated address PCIe Device with Address Translation Services Hypervisor PCIe Root Complex ATPT PCIe Port PCIe Device Routing PF0 VF1 VF2 VF3 ATC2 ATC3 ATC1 DownstreamPort

  11. PCI Single-Root IOV Overview

  12. Today’s PCI Device PCIeSR-IOVCapable Device PCIe Port • Function 0 is required • Overview of Function Attributes • Each Function has a its own configuration and PCIe memory address space • Up to 8 PCI Functions with unique configuration space / BAR / etc. • ARI Capability enables up to 256 Functions to be supported • Support INTx, MSI, MSI-X or combination of MSI and MSI-X • Function dependencies through vendor specific mechanisms • Cannot be directly shared by SIs • Vendor specific mechanisms to associateFunctions to “South-side” resources F0 DownstreamPort F1 ATC1 Internal Routing Internal Routing Resources1 F2 ATC2 Resources2 : FN ATCN ResourcesN

  13. Function 0 is required Overview of VF attributes Each VF has a its own configuration and PCIe memory address space VFs share a contiguous PCIe memory address space Up to ~216 Virtual Functions ARI enables up to 256 IOV enables additional Bus Numbers to be associated Function dependencies defined through standard mechanism Support MSI and MSI-X Can be directly shared by SIs Vendor specific mechanisms to associate Functions to “South-side” resources SR-IOV Device Overview PCIeSR-IOVCapable Device PCIe Port PF0 DownstreamPort VF1 ATC1 Internal Routing Internal Routing Resources1 VF2 ATC2 Resources2 : VFN ATCN ResourcesN

  14. Each PF must have an SR-IOV Extended Capability structure A Device may have a mix of Functions and PFs Each Function and PF consumes one Routing Identifier (RID) For a Device that isn’t ARI capable, Functions and PFs must be in the first 8 functions; is ARI capable, number of Functions and PFs supported is as defined in the base specification SR-IOV Device Discovery SRPCIM SI A SIB PCI Root PCIe Topology PF0 PF1 F2 PF3

  15. VF Discovery - Part 1 SRPCIM SI A SIB • The TotalVFs field is used to discover the number of Active VFs that a PF could have. • The InitialVFs field is used to discover the number of Active VFs that a PF initially has. • Note for a Device that isn’t MR capable*: 1 ≤ InitialVFs = TotalVFs PCI Root PCIe Topology PF0 PF1 F2 PF3

  16. BAR Discovery SRPCIM SI A SIB • Each PF has its own independent set of BARs in its standard configuration space and an MSE bit for those BARs • The VFs share a BAR set and have an MSE bit that controls the memory space of all the VFs • The BAR set that is shared by all the VFs resides in the PF’s SR-IOV capabilities PCI Root PCIe Topology PF0 PF1 F2 PF3

  17. Supported Page Sizes SRPCIM SI A SIB • Supported Page Sizes is used to discover the page sizes supported by the VFs associated with the PF • When this field is read, the PF must return the page sizes it can support • This field will be used during the IOV configuration phase to align VF BAR apertures on system page boundaries (more later) PCI Root PCIe Topology PF0 PF1 F2 PF3

  18. Configuring VFs SRPCIM SI A SIB • VFs for a PF are enabled by writing NumVFs and then Setting the “VF Enable” bit • When VF’s are enabled, the PCIe Device must associate NumVFsworth of VFs with the PF • If VF Migration Capable and VF Migration Enabled set, then NumVFs must be 1 ≤ NumVFs ≤ TotalVFs • Otherwise NumVFs must be 1 ≤ NumVFs ≤ InitialVFs PCI Root PCIe Topology PF0 PF1 F2 PF3 VF Migration Capable VF Migration Enable VF Enable

  19. Configuring The VFs’ BARs SRPCIM SI A SIB • System Page Size defines the page size the system will use • Value of System Page Size must be one of the Supported Page Sizes • System Page Size is used by the PF to align the MMIO aperture defined by each BAR to a system page boundary • VF BARs behave as in PCI 3.0 Spec’s PCI BARs, except that a VF BAR describes the aperture for multiple VFs (see next page) PCI Root PCIe Topology PF0 PF1 F2 PF3

  20. PF And VF BAR Semantics • The memory aperture required for each VF BAR can be determined by writing all “1”s and then reading the VF BAR • Behaves as in the base PCI Spec • The address written into each VF BAR is used by the Device to set the starting address for that BAR on the first VF • The differences between VF BARs and PCI 3.0 Spec BARs are • For each VF BAR, the memory space associated with the 2nd and higher VFs is derived from the starting address of the first VF and the memory space aperture • The VF BAR’s MMIO space is not enabled until VF Enable and VF MSE have been set BAR1 VF N SA = BAR1 VF 1 SA + N x (BAR1 VF MA) - 1 where BAR1 VF 1 SA is the address written into VF BAR1and MA is the memory aperture of VF BAR1.

  21. VF Discovery – Part 2 SRPCIM SI A SIB • The values in NumVFs and VF ARI Enable affect the values for First VF Offset and VF Stride* • First VF Offset and VF Stride are defined by the Device • Following is the equation for determining the RID of VF N: VF N = [PF RID + VF Offset + (N-1) * VF Stride] Modulo 216 where all arithmetic used is 16 bit unsigned dropping all carries PCI Root PCIe Topology PF0 PF1 F2 *Note: After setting NumVFs and VF ARI Enable, SR-PCIM can read these two fields to determine how many busses will be consumed bt the PF’sVFs PF3

  22. PF And VF BAR Semantics PCIe RID Number • PF and VF RIDs must not overlap given any valid NumVFs setting across all PFs of a Device • As in base, an SR-IOV Device captures Bus # from any Type 0 configrequest • VFs may reside on a different bus number as associated PF • If the switch above the Device • supports ARI Forwarding bit, RIDs are interpreted as 8 bit Bus # and 8 bit Device # • doesn’t supports ARI, RIDs are interpreted as BDF# • Bus # can be used to assign VFs,but PFs must be on 1stBus assigned to Device PF 0 RID = 0200 PF 1 RID = 0201

  23. Reset Mechanisms • Three reset mechanisms are supported • Conventional Reset – Resets all PF and VF state • FLR that targets a VF – Resets a single VF • FLR that targets a PF – Resets a PF and its associated VFs

  24. Conventional Reset • A Fundamental or Hot Reset to an SR-IOV Device shall cause all Functions, PFs, and VFs context to be reset to their original, power-on state • If a PF has its VFs enabled and a Fundamental or Hot Reset is issued to the Device, the Device must reset all PF and VF state, eg: • The PF must disable its SR-IOV capabilities andreverts back to being a PCI Function • Settable SR-IOV capabilities (e.g., NumVFs) are reset to default values • MSE and BME are both off • All BAR values used by the PF and VFs are indeterminate. • All interrupts are disabled

  25. FLRs To VFs And PFs • An FLR that targets a VF must be supported • Software may use FLR to reset a VF • An FLR that targets a VF must • Not affect the VFs existence (e.g. it still consumes a RID) • Not affect any address assigned to it. That is, the VF’s BAR registers and MSE are unaffected by FLR • An FLR that targets a PF must be supported • Software may use an FLR to reset a PF • An FLR that targets a PF must • Reset the PF’s SR-IOV Extended Capabilities • The VFs are no longer allocated or enabled

  26. PCI Multi-Root IOV Overview

  27. Today’s PCIe Topology • Strict Tree • Root Ports Connect to Switches • Switches Connect to Devices • Roots Ports can also connect to Devices • Switches can also connect to more Switches • Single Software Management Entity • Runs above the Root • Implicit Message Routing • Route to Root • Broadcast from Root

  28. FC SAN FC SAN Ethernet LAN Ethernet LAN Single Switch MR Topology Root A Root B Root C Root D SI MR Switch MR Device MR Device PCIe Device MR Device

  29. FC SAN FC SAN Ethernet LAN Ethernet LAN Two Switch MR Topology Root A Root B Root C Root D Left Right MR Device MR Device PCIe Device MR Device

  30. LCRC STP Sequence # TLP Prefix ECRC PCIe TLP Header TLP Data (optional) ECRC (optional) LCRC END TLP Labeling • TLP Prefix on MR Links • After Seq #, before TLP Hdr • Sent / Resent with TLP • Prefix Contains • VH Number (link local) • VL Number (flow control) • Global Key (error checking) • Added at MR Ingress • First MR Component seen • Dropped at MR Egress • Last MR Component seen

  31. SR-IOV Device Overview MR-IOVCapable Device PCIe Port PF0 PF0VF1 : DownstreamPort Internal Routing Internal Routing PF0VFN

  32. MR-IOV Device Overview • Each Virtual Hierarchy (VH) has a full PCIe address space • Configuration, Memory and I/O Space • Up to 28 Virtual Hierarchies • VH0 is used to Manage MR-IOV features of the Device • BF in VH0 manages associated PFs and VFs • Other Functions optional in VH0 • VH1 to VHmax have same PFs at the same Function #s • Each PF in each VH has it’s own values for InitialVFs, TotalVFs, NumVFs, VF Stride and VF Offset MR-IOVCapable Device PCIe Port VH1 PF0 VH1 PF0VF1 : DownstreamPort Internal Routing Internal Routing VH1 PF0 VFN VHK PF0 VH0 BF0 VHK PF0 VF1 : VHK PF0 VFM

  33. MR-IOV Device Configuration • Initially, only VH0 is enabled • Within VH0, MR-PCIM enumerates Config Space • Locate all BFs by looking for MR-IOV Capability For each Device located… • In BF0 • Determine MaxVH, set NumVH • Enable additional VLs (if available) • Configure VL arbitration (optional) • Configure per VH mapping of VCID to VL • Configure per VH Global Keys • In every BF • In each VH, provision VC Resources (optional) • If hardware present, configure Initial VF mapping

  34. PCI Specification Schedule ATS SR-IOV MR-IOV

  35. Schedules • ATS Specification released 3/8/2007 • SR-IOV Specification • Following PCI SIG Specification Process • Draft 0.7 Completed • Draft 0.9 May, 2007 • Version 1.0 early 3Q/2007 • MR-IOV Specification • Following PCI SIG Specification Process. • Draft 0.5 Completed • Draft 0.7 2Q/2007 • Draft 0.9 late 2Q/2007 • Version 1.0 late 3Q/2007

  36. Call To Action • Please participate in the PCI-SIG Specification Development Process • For more information please go to www.pcisig.com • Thank you for attending

  37. Workgroup Members • AMD • Broadcom • Emulex • HP • IBM • IDT • Intel • LSI Logic • Microsoft • NextIO • Neterion • Nvidia • PLX • Qlogic • Stargen • Sun • VMWare

  38. Questions

  39. MR Topics Back-ups

  40. Switch TLP Routing • Three step routing decision: • MR Input Map • { Input Port, Input VH# }  { VS, VS Input Port } • MR-PCIM Manages • VS PCIe Map • { VS Input Port, TLP Hdr }  { VS Output Port } • VH Software Manages using PCIe rules • MR Output Map • { VS, VS Output Port }  { Output Port, Output VH# } • MR-PCIM Manages

  41. Hot Reset • PCIe Hot Reset is broadcast downstream • Switch Upstream port  all Downstream ports • Uses Phy Training Sequence • Handshake to detect link partner received • MR Reset is per VH • VS Upstream port  all VS Downstream ports • Uses Reset DLLP • Handshake to detect link partner received • Link stays up, other VHs unaffected

  42. Flow Control • PCIe uses Virtual Channels (VC) for QoS, traffic isolation, etc. • MR Extends this to Virtual Links (VL) • MR adds per VH, VL flow control • Traffic Gate • PCIe: sufficient VC Credits to send TLP • MR: sufficient VL Credits to send TLPand sufficient VH, VL Credits to send TLP

  43. Management / Authorization • Devices are managed using VH0 • Switches are managed using the Upstream port of any Authorized VS • Multiple VS may be simultaneously authorized • Supports MR-PCIM failover • At failover, new MR-PCIM remaps VH0 so it can mange MR Devices

  44. Virtual Hot Plug • Virtual Hot Plug is implemented in each Downstream Port of each VS • Software in VH sees PCIe Slot Control Regs • MR-PCIM controls virtual “buttons & lights” • e.g. pushing the virtual Attention Button, detecting virtual slot power state, … • Physical Hot-Plug is implemented in each Physical Port • Managed by MR-PCIM • Optional (same as PCIe)

  45. Power Management • Each PF has a power state • if all PFs in low power state, link can go to low power state (like PCIe multi-function rules) • PM_PME Messages sometimes trigger WAKE# • e.g., some Root powered itself off but a shared device below it was only virtually powered off (Device still being used by other VHs)

  46. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related