[PATCH v7 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

Thu Nov 5 00:36:25 UTC 2020

> -----Original Message-----
> From: Jason Gunthorpe <jgg at ziepe.ca>
> Sent: Wednesday, November 04, 2020 4:07 PM
> To: Xiong, Jianxin <jianxin.xiong at intel.com>
> Cc: linux-rdma at vger.kernel.org; dri-devel at lists.freedesktop.org; Doug Ledford <dledford at redhat.com>; Leon Romanovsky
> <leon at kernel.org>; Sumit Semwal <sumit.semwal at linaro.org>; Christian Koenig <christian.koenig at amd.com>; Vetter, Daniel
> <daniel.vetter at intel.com>
> Subject: Re: [PATCH v7 4/5] RDMA/mlx5: Support dma-buf based userspace memory region
> 
> On Wed, Nov 04, 2020 at 02:06:34PM -0800, Jianxin Xiong wrote:
> > +	umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags,
> > +				  &mlx5_ib_dmabuf_attach_ops);
> > +	if (IS_ERR(umem)) {
> > +		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
> > +		return ERR_PTR(PTR_ERR(umem));
> > +	}
> > +
> > +	mr = alloc_mr_from_cache(pd, umem, virt_addr, access_flags);
> 
> It is very subtle, but this calls mlx5_umem_find_best_pgsz() which calls ib_umem_find_best_pgsz() which goes over the SGL to determine
> the page size to use.
> 

When this is called here, the umem sglist is still NULL because dma_buf_map_attachment()
is not called until a page fault occurs. In patch 1/5, the function ib_umem_find_best_pgsz()
has been modified to always return PAGE_SIZE for dma-buf based MR.

> As part of this it does validation of the IOVA vs first page offset vs first page dma address. These little details come into play if the IOVA and
> offset are not PAGE_SIZE aligned, which is very possible if the dma buf exporter or system PAGE_SIZE is over 4k.
> 
> In other words, the dma_address of the first SGL must be the page aligned starting point of the MR. Since the 'skip' approach is being done
> when breaking the SGL into blocks the ib_umem_find_best_pgsz() sees an invalid page size.
> 
> Slicing it has to be done in a way that gives a properly formed SGL.
> 
> My suggestion is to just change the SGL in place. Iterate to the starting SGE in the SGL and assign it to the sg table, modify it to have a offset
> dma_address and reduced length
> 
> Count the number of SGEs to span the remaning range and use that as the new nmaped
> 
> Edit the last SGE to have a reduced length

Do you still think modifying the SGL in place needed given the above explanation? I do see
some benefits of doing so -- hiding the discrepancy of sgl and addr/length from the device drivers and avoid special handling in the code that use the sgl. 

> 
> Upon unmap undo the edits so the exporter doesn't see the mangled SGL.
> 
> It would be saner if the exporter could do this, but oh well.
> 
> Approximately like this:
> 
> 	struct ib_umem *umem = &umem_p->umem;
> 	struct scatterlist *sg;
> 	int i;
> 
> 	for_each_sg(umem_p->umem.sg_head.sgl, sg, umem_p->umem.nmap, i) {
> 		if (cur + sg_dma_len(sg) > ALIGN_DOWN(umem->address, PAGE_SIZE)) {
> 			unsigned long offset;
> 
> 			umem_p->first_sg = sg;
> 			umem_p->first_dma_address = sg->dma_address;
> 			umem_p->first_dma_length = sg_dma_len(sg);
> 			umem_p->first_length = sg->length;
> 			offset = ALIGN_DOWN(umem->addressm PAGE_SIZE) - cur;
> 			sg->dma_address += offset;
> 			sg_dma_len(sg) -= offset;
> 			sg->length -= offset;
> 		}
> 		if (ALIGN(umem->address + umem->length, PAGE_SIZE) < cur + sg_dma_len(sg)) {
> 			unsigned long trim;
> 
> 			umem_p->last_sg = sg;
> 			umem_p->last_dma_length = sg_dma_len(sg);
> 			umem_p->last_length = sg->length;
> 			trim =  cur + sg_dma_len(sg) - ALIGN(umem->address + umem->length, PAGE_SIZE);
> 			sg_dma_len(sg) -= trim;
> 			sg->length -= trim;
> 			return npages;
> 		}
>                 cur += sg_dma_len(sg);
> 	}
>         /* It is essential that the length of the SGL exactly match
>   	   the adjusted page aligned length of umem->length */
> 	return -EINVAL;
> 
> Further, this really only works if the umem->page_size is locked to 4k because this doesn't have code to resize the MKEY, or change the
> underlying page size when the SGL changes.

Yes, now it's locked to 4K. 

> 
> So, I'd say put something like the above in the core code to validate and properly form the umem->sgl
> 
> Then modify the alloc_mr_from_cache to use only PAGE_SIZE:
> 
>  if (umem->is_dma_buf)
>         page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova);  else
>     	page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova);
> 

This should have been addressed in patch 1/5.

Thanks,
Jianxin

> Jason