[PATCH v16 0/4] RDMA: Add dma-buf support
Xiong, Jianxin
jianxin.xiong at intel.com
Mon Jan 11 15:24:18 UTC 2021
Jason, will this series be able to get into 5.12?
> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong at intel.com>
> Sent: Tuesday, December 15, 2020 1:27 PM
> To: linux-rdma at vger.kernel.org; dri-devel at lists.freedesktop.org
> Cc: Xiong, Jianxin <jianxin.xiong at intel.com>; Doug Ledford <dledford at redhat.com>; Jason Gunthorpe <jgg at ziepe.ca>; Leon Romanovsky
> <leon at kernel.org>; Sumit Semwal <sumit.semwal at linaro.org>; Christian Koenig <christian.koenig at amd.com>; Vetter, Daniel
> <daniel.vetter at intel.com>
> Subject: [PATCH v16 0/4] RDMA: Add dma-buf support
>
> This is the sixteenth version of the patch set. Changelog:
>
> v16:
> * Add "select DMA_SHARED_BUFFER" to Kconfig when IB UMEM is enabled.
> This fixes the auto build test error with a random config.
>
> v15: https://www.spinics.net/lists/linux-rdma/msg98369.html
> * Rebase to the latest linux-rdma 'for-next' branch (commit 0583531bb9ef)
> to pick up RDMA core and mlx5 updates
> * Let ib_umem_dmabuf_get() return 'struct ib_umem_dmabuf *' instead of
> 'struct ib_umem *'
> * Move the check of on demand paging support to mlx5_ib_reg_user_mr_dmabuf()
> * Check iova alignment at the entry point of the uverb command so that
> mlx5_umem_dmabuf_default_pgsz() can always succeed
>
> v14: https://www.spinics.net/lists/linux-rdma/msg98265.html
> * Check return value of dma_fence_wait()
> * Fix a dma-buf leak in ib_umem_dmabuf_get()
> * Fix return value type cast for ib_umem_dmabuf_get()
> * Return -EOPNOTSUPP instead of -EINVAL for unimplemented functions
> * Remove an unnecessary use of unlikely()
> * Remove left-over commit message resulted from rebase
>
> v13: https://www.spinics.net/lists/linux-rdma/msg98227.html
> * Rebase to the latest linux-rdma 'for-next' branch (5.10.0-rc6+)
> * Check for device on-demand paging capability at the entry point of
> the new verbs command to avoid calling device's reg_user_mr_dmabuf()
> method when CONFIG_INFINIBAND_ON_DEMAND_PAGING is diabled.
>
> v12: https://www.spinics.net/lists/linux-rdma/msg97943.html
> * Move the prototype of function ib_umem_dmabuf_release() to ib_umem.h
> and remove umem_dmabuf.h
> * Break a line that is too long
>
> v11: https://www.spinics.net/lists/linux-rdma/msg97860.html
> * Rework the parameter checking code inside ib_umem_dmabuf_get()
> * Fix incorrect error handling in the new verbs command handler
> * Put a duplicated code sequence for checking iova and setting page size
> into a function
> * In the invalidation callback, check for if the buffer has been mapped
> and thus the presence of a valid driver mr is ensured
> * The patch that checks for dma_virt_ops is dropped because it is no
> longer needed
> * The patch that documents that dma-buf size is fixed has landed at:
> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=476b485be03c
> and thus is no longer included here
> * The matching user space patch set is sent separately
>
> v10: https://www.spinics.net/lists/linux-rdma/msg97483.html
> * Don't map the pages in ib_umem_dmabuf_get(); use the size information
> of the dma-buf object to validate the umem size instead
> * Use PAGE_SIZE directly instead of use ib_umem_find_best_pgsz() when
> the MR is created since the pages have not been mapped yet and dma-buf
> requires PAGE_SIZE anyway
> * Always call mlx5_umem_find_best_pgsz() after mapping the pages to
> verify that the page size requirement is satisfied
> * Add a patch to document that dma-buf size is fixed
>
> v9: https://www.spinics.net/lists/linux-rdma/msg97432.html
> * Clean up the code for sg list in-place modification
> * Prevent dma-buf pages from being mapped multiple times
> * Map the pages in ib_umem_dmabuf_get() so that inproper values of
> address/length/iova can be caught early
> * Check for unsupported flags in the new uverbs command
> * Add missing uverbs_finalize_uobj_create()
> * Sort uverbs objects by name
> * Fix formating issue -- unnecessary alignment of '='
> * Unmap pages in mlx5_ib_fence_dmabuf_mr()
> * Remove address range checking from pagefault_dmabuf_mr()
>
> v8: https://www.spinics.net/lists/linux-rdma/msg97370.html
> * Modify the dma-buf sg list in place to get a proper umem sg list and
> restore it before calling dma_buf_unmap_attachment()
> * Validate the umem sg list with ib_umem_find_best_pgsz()
> * Remove the logic for slicing the sg list at runtime
>
> v7: https://www.spinics.net/lists/linux-rdma/msg97297.html
> * Rebase on top of latest mlx5 MR patch series
> * Slice dma-buf sg list at runtime instead of creating a new list
> * Preload the buffer page mapping when the MR is created
> * Move the 'dma_virt_ops' check into dma_buf_dynamic_attach()
>
> v6: https://www.spinics.net/lists/linux-rdma/msg96923.html
> * Move the dma-buf invalidation callback from the core to the device
> driver
> * Move mapping update from work queue to pagefault handler
> * Add dma-buf based MRs to the xarray of mmkeys so that the pagefault
> handler can be reached
> * Update the new driver method and uverbs command signature by changing
> the paramter 'addr' to 'offset'
> * Modify the sg list returned from dma_buf_map_attachment() based on
> the parameters 'offset' and 'length'
> * Don't import dma-buf if 'dma_virt_ops' is used by the dma device
> * The patch that clarifies dma-buf sg lists alignment has landed at
> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=ac80cd17a615
> and thus is no longer included with this set
>
> v5: https://www.spinics.net/lists/linux-rdma/msg96786.html
> * Fix a few warnings reported by kernel test robot:
> - no previous prototype for function 'ib_umem_dmabuf_release'
> - no previous prototype for function 'ib_umem_dmabuf_map_pages'
> - comparison of distinct pointer types in 'check_add_overflow'
> * Add comment for the wait between getting the dma-buf sg tagle and
> updating the NIC page table
>
> v4: https://www.spinics.net/lists/linux-rdma/msg96767.html
> * Add a new ib_device method reg_user_mr_dmabuf() instead of expanding
> the existing method reg_user_mr()
> * Use a separate code flow for dma-buf instead of adding special cases
> to the ODP memory region code path
> * In invalidation callback, new mapping is updated as whole using work
> queue instead of being updated in page granularity in the page fault
> handler
> * Use dma_resv_get_excl() and dma_fence_wait() to ensure the content of
> the pages have been moved to the new location before the new mapping
> is programmed into the NIC
> * Add code to the ODP page fault handler to check the mapping status
> * The new access flag added in v3 is removed.
> * The checking for on-demand paging support in the new uverbs command
> is removed because it is implied by implementing the new ib_device
> method
> * Clarify that dma-buf sg lists are page aligned
>
> v3: https://www.spinics.net/lists/linux-rdma/msg96330.html
> * Use dma_buf_dynamic_attach() instead of dma_buf_attach()
> * Use on-demand paging mechanism to avoid pinning the GPU memory
> * Instead of adding a new parameter to the device method for memory
> registration, pass all the attributes including the file descriptor
> as a structure
> * Define a new access flag for dma-buf based memory region
> * Check for on-demand paging support in the new uverbs command
>
> v2: https://www.spinics.net/lists/linux-rdma/msg93643.html
> * The Kconfig option is removed. There is no dependence issue since
> dma-buf driver is always enabled.
> * The declaration of new data structure and functions is reorganized to
> minimize the visibility of the changes.
> * The new uverbs command now goes through ioctl() instead of write().
> * The rereg functionality is removed.
> * Instead of adding new device method for dma-buf specific registration,
> existing method is extended to accept an extra parameter.
> * The correct function is now used for address range checking.
>
> v1: https://www.spinics.net/lists/linux-rdma/msg90720.html
> * The initial patch set
> * Implement core functions for importing and mapping dma-buf
> * Use dma-buf static attach interface
> * Add two ib_device methods reg_user_mr_fd() and rereg_user_mr_fd()
> * Add two uverbs commands via the write() interface
> * Add Kconfig option
> * Add dma-buf support to mlx5 device
>
> When enabled, an RDMA capable NIC can perform peer-to-peer transactions over PCIe to access the local memory located on another
> device. This can often lead to better performance than using a system memory buffer for RDMA and copying data between the buffer and
> device memory.
>
> Current kernel RDMA stack uses get_user_pages() to pin the physical pages backing the user buffer and uses dma_map_sg_attrs() to get the
> dma addresses for memory access. This usually doesn't work for peer device memory due to the lack of associated page structures.
>
> Several mechanisms exist today to facilitate device memory access.
>
> ZONE_DEVICE is a new zone for device memory in the memory management subsystem. It allows pages from device memory being
> described with specialized page structures, but what can be done with these page structures may be different from system memory.
> ZONE_DEVICE is further specialized into multiple memory types, such as one type for PCI p2pmem/p2pdma and one type for HMM.
>
> PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing in a PCI BAR and provides a set of calls to publish, discover,
> allocate, and map such memory for peer-to-peer transactions. One feature of the API is that the buffer is allocated by the side that does the
> DMA transfer. This works well with the storage usage case, but is awkward with GPU-NIC communication, where typically the buffer is
> allocated by the GPU driver rather than the NIC driver.
>
> Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and
> page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on
> device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is currently not supported by HMM.
>
> Dma-buf is a standard mechanism for sharing buffers among different device drivers. The buffer to be shared is exported by the owning
> driver and imported by the driver that wants to use it. The exporter provides a set of ops that the importer can call to pin and map the
> buffer. In addition, a file descriptor can be associated with a dma- buf object as the handle that can be passed to user space.
>
> This patch series adds dma-buf importer role to the RDMA driver in attempt to support RDMA using device memory such as GPU VRAM.
> Dma-buf is chosen for a few reasons: first, the API is relatively simple and allows a lot of flexibility in implementing the buffer manipulation
> ops.
> Second, it doesn't require page structure. Third, dma-buf is already supported in many GPU drivers. However, we are aware that existing
> GPU drivers don't allow pinning device memory via the dma-buf interface.
> Pinning would simply cause the backing storage to migrate to system RAM.
> True peer-to-peer access is only possible using dynamic attach, which requires on-demand paging support from the NIC to work. For this
> reason, this series only works with ODP capable NICs.
>
> This series consists of four patches. The first patch adds the common code for importing dma-buf from a file descriptor and mapping the
> dma-buf pages. Patch 2 add the new driver method reg_user_mr_dmabuf().
> Patch 3 adds a new uverbs command for registering dma-buf based memory region. Patch 4 adds dma-buf support to the mlx5 driver.
>
> Related user space RDMA library changes are provided as a separate patch series.
>
> Jianxin Xiong (4):
> RDMA/umem: Support importing dma-buf as user memory region
> RDMA/core: Add device method for registering dma-buf based memory
> region
> RDMA/uverbs: Add uverbs command for dma-buf based MR registration
> RDMA/mlx5: Support dma-buf based userspace memory region
>
> drivers/infiniband/Kconfig | 1 +
> drivers/infiniband/core/Makefile | 2 +-
> drivers/infiniband/core/device.c | 1 +
> drivers/infiniband/core/umem.c | 3 +
> drivers/infiniband/core/umem_dmabuf.c | 174 ++++++++++++++++++++++++++
> drivers/infiniband/core/uverbs_std_types_mr.c | 117 ++++++++++++++++-
> drivers/infiniband/hw/mlx5/main.c | 2 +
> drivers/infiniband/hw/mlx5/mlx5_ib.h | 18 +++
> drivers/infiniband/hw/mlx5/mr.c | 112 ++++++++++++++++-
> drivers/infiniband/hw/mlx5/odp.c | 89 ++++++++++++-
> include/rdma/ib_umem.h | 48 ++++++-
> include/rdma/ib_verbs.h | 6 +-
> include/uapi/rdma/ib_user_ioctl_cmds.h | 14 +++
> 13 files changed, 573 insertions(+), 14 deletions(-) create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>
> --
> 1.8.3.1
More information about the dri-devel
mailing list