[RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region

Tue Oct 6 15:26:21 UTC 2020

> -----Original Message-----
> From: Daniel Vetter <daniel at ffwll.ch>
> Sent: Tuesday, October 06, 2020 2:22 AM
> To: Xiong, Jianxin <jianxin.xiong at intel.com>
> Cc: Jason Gunthorpe <jgg at ziepe.ca>; Leon Romanovsky <leon at kernel.org>; linux-rdma at vger.kernel.org; dri-devel at lists.freedesktop.org;
> Doug Ledford <dledford at redhat.com>; Vetter, Daniel <daniel.vetter at intel.com>; Christian Koenig <christian.koenig at amd.com>
> Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> On Mon, Oct 05, 2020 at 04:18:11PM +0000, Xiong, Jianxin wrote:
> > > -----Original Message-----
> > > From: Jason Gunthorpe <jgg at ziepe.ca>
> > > Sent: Monday, October 05, 2020 6:13 AM
> > > To: Xiong, Jianxin <jianxin.xiong at intel.com>
> > > Cc: linux-rdma at vger.kernel.org; dri-devel at lists.freedesktop.org;
> > > Doug Ledford <dledford at redhat.com>; Leon Romanovsky
> > > <leon at kernel.org>; Sumit Semwal <sumit.semwal at linaro.org>; Christian
> > > Koenig <christian.koenig at amd.com>; Vetter, Daniel
> > > <daniel.vetter at intel.com>
> > > Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf
> > > as user memory region
> > >
> > > On Sun, Oct 04, 2020 at 12:12:28PM -0700, Jianxin Xiong wrote:
> > > > Dma-buf is a standard cross-driver buffer sharing mechanism that
> > > > can be used to support peer-to-peer access from RDMA devices.
> > > >
> > > > Device memory exported via dma-buf is associated with a file descriptor.
> > > > This is passed to the user space as a property associated with the
> > > > buffer allocation. When the buffer is registered as a memory
> > > > region, the file descriptor is passed to the RDMA driver along
> > > > with other parameters.
> > > >
> > > > Implement the common code for importing dma-buf object and mapping
> > > > dma-buf pages.
> > > >
> > > > Signed-off-by: Jianxin Xiong <jianxin.xiong at intel.com>
> > > > Reviewed-by: Sean Hefty <sean.hefty at intel.com>
> > > > Acked-by: Michael J. Ruhl <michael.j.ruhl at intel.com>
> > > > ---
> > > >  drivers/infiniband/core/Makefile      |   2 +-
> > > >  drivers/infiniband/core/umem.c        |   4 +
> > > >  drivers/infiniband/core/umem_dmabuf.c | 291
> > > > ++++++++++++++++++++++++++++++++++
> > > >  drivers/infiniband/core/umem_dmabuf.h |  14 ++
> > > >  drivers/infiniband/core/umem_odp.c    |  12 ++
> > > >  include/rdma/ib_umem.h                |  19 ++-
> > > >  6 files changed, 340 insertions(+), 2 deletions(-)  create mode
> > > > 100644 drivers/infiniband/core/umem_dmabuf.c
> > > >  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> > >
> > > I think this is using ODP too literally, dmabuf isn't going to need
> > > fine grained page faults, and I'm not sure this locking scheme is OK - ODP is horrifically complicated.
> > >
> >
> > > If this is the approach then I think we should make dmabuf its own
> > > stand alone API, reg_user_mr_dmabuf()
> >
> > That's the original approach in the first version. We can go back there.
> >
> > >
> > > The implementation in mlx5 will be much more understandable, it
> > > would just do dma_buf_dynamic_attach() and program the XLT exactly the same as a normal umem.
> > >
> > > The move_notify() simply zap's the XLT and triggers a work to reload
> > > it after the move. Locking is provided by the dma_resv_lock. Only a small disruption to the page fault handler is needed.
> > >
> >
> > We considered such scheme but didn't go that way due to the lack of
> > notification when the move is done and thus the work wouldn't know
> > when it can reload.
> >
> > Now I think it again, we could probably signal the reload in the page fault handler.
> 
> For reinstanting the pages you need:
> 
> - dma_resv_lock, this prevents anyone else from issuing new moves or
>   anything like that
> - dma_resv_get_excl + dma_fence_wait to wait for any pending moves to
>   finish. gpus generally don't wait on the cpu, but block the dependent
>   dma operations from being scheduled until that fence fired. But for rdma
>   odp I think you need the cpu wait in your worker here.
> - get the new sg list, write it into your ptes
> - dma_resv_unlock to make sure you're not racing with a concurrent
>   move_notify
> 
> You can also grab multiple dma_resv_lock in atomically, but I think the odp rdma model doesn't require that (gpus need that).
> 
> Note that you're allowed to allocate memory with GFP_KERNEL while holding dma_resv_lock, so this shouldn't impose any issues. You are
> otoh not allowed to cause userspace faults (so no gup/pup or copy*user with faulting enabled). So all in all this shouldn't be any worse that
> calling pup for normal umem.
> 
> Unlike mmu notifier the caller holds dma_resv_lock already for you around the move_notify callback, so you shouldn't need any additional
> locking in there (aside from what you need to zap the ptes and flush hw tlbs).
> 
> Cheers, Daniel
> 

Hi Daniel, thanks for providing the details. I would have missed the dma_resv_get_excl + dma_fence_wait part otherwise. 

> >
> > > > +	dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
> > > > +	sgt = dma_buf_map_attachment(umem_dmabuf->attach,
> > > > +				     DMA_BIDIRECTIONAL);
> > > > +	dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
> > >
> > > This doesn't look right, this lock has to be held up until the HW is
> > > programmed
> >
> > The mapping remains valid until being invalidated again. There is a sequence number checking before programming the HW.
> >
> > >
> > > The use of atomic looks probably wrong as well.
> >
> > Do you mean umem_dmabuf->notifier_seq? Could you elaborate the concern?
> >
> > >
> > > > +	k = 0;
> > > > +	total_pages = ib_umem_odp_num_pages(umem_odp);
> > > > +	for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.nents, j) {
> > > > +		addr = sg_dma_address(sg);
> > > > +		pages = sg_dma_len(sg) >> page_shift;
> > > > +		while (pages > 0 && k < total_pages) {
> > > > +			umem_odp->dma_list[k++] = addr | access_mask;
> > > > +			umem_odp->npages++;
> > > > +			addr += page_size;
> > > > +			pages--;
> > >
> > > This isn't fragmenting the sg into a page list properly, won't work
> > > for unaligned things
> >
> > I thought the addresses are aligned, but will add explicit alignment here.
> >
> > >
> > > And really we don't need the dma_list for this case, with a fixed
> > > whole mapping DMA SGL a normal umem sgl is OK and the normal umem XLT programming in mlx5 is fine.
> >
> > The dma_list is used by both "polulate_mtt()" and "mlx5_ib_invalidate_range", which are used for XLT programming and invalidating
> (zapping), respectively.
> >
> > >
> > > Jason
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch