[Mesa-dev] [RFC PATCH] mesa: Export BOs in RW mode

Wed Jul 3 16:18:11 UTC 2019

On Wed, Jul 3, 2019 at 6:11 PM Steven Price <steven.price at arm.com> wrote:
>
> On 03/07/2019 15:33, Boris Brezillon wrote:
> > On Wed, 3 Jul 2019 15:13:25 +0100
> > Steven Price <steven.price at arm.com> wrote:
> >
> >> On 03/07/2019 14:56, Boris Brezillon wrote:
> >>> On Wed, 3 Jul 2019 07:45:32 -0600
> >>> Rob Herring <robh+dt at kernel.org> wrote:
> >>>
> >>>> On Wed, Jul 3, 2019 at 7:34 AM Boris Brezillon
> >>>> <boris.brezillon at collabora.com> wrote:
> >>>>>
> >>>>> Exported BOs might be imported back, then mmap()-ed to be written
> >>>>> too. Most drivers handle that by mmap()-ing the GEM handle after it's
> >>>>> been imported, but, according to [1], this is illegal.
> >>>>
> >>>> It's not illegal, but is supposed to go thru the dmabuf mmap
> >>>> functions.
> >>>
> >>> That's basically what I'm proposing here, just didn't post the patch
> >>> skipping the GET_OFFSET step and doing the mmap() on the dmabuf FD
> >>> instead of the DRM-node one, but I have it working for panfrost.
> >>
> >> If we want to we could make the Panfrost kernel driver internally call
> >> dma_buf_mmap() so that mapping using the DRM-node "just works". This is
> >> indeed what the kbase driver does.
> >
> > Well, userspace should at least skip DRM_IOCTL_PANFROST_MMAP_BO (or
> > ignore its return code), so calling mmap() on the dmabuf FD instead of
> > the DRM-node FD shouldn't be that hard.
>
> What I was suggesting is that user space would still call
> DRM_IOCTL_PANFROST_MMAP_BO to get an offset which uses in a call to
> mmap(..., drm_node_fd, offset). The kernel could detect that the buffer
> is imported and call the exporter for the actual mmap() functionality.
>
> The alternative is that user space 'simply' remembers that a buffer is
> imported and keeps the file descriptor around so that it can instead
> directly mmap() the dma_buf fd. Which is certainly easiest from the
> kernel's perspective (and was what I assumed panfrost was doing - I
> should have checked more closely!).
>
> >>>> However, none of the driver I've looked at (etnaviv, msm,
> >>>> v3d, vgem) do that. It probably works because it's the same driver
> >>>> doing the import and export or both drivers have essentially the same
> >>>> implementations.
> >>>
> >>> Yes, but maybe that's something we should start fixing if mmap()-ing
> >>> the dmabuf is the recommended solution.
> >>
> >> I'm open to options here. User space calling mmap() on the dma_buf file
> >> descriptor should always be safe (the exporter can do whatever is
> >> necessary to make it work). If that happens then the patches I posted
> >> close off the DRM node version which could be broken if the exporter
> >> needs to do anything to prepare the buffer for CPU access (i.e. cache
> >> maintenance).
> >
> > Talking about CPU <-> GPU syncs, I was wondering if the
> > mmap(gem_handle) step was providing any guarantee that would
> > allow us to ignore all the cache maintenance operations that are
> > required when mmap()-ing a dmabuf directly. Note that in both cases the
> > dmabuf is imported.
>
> In theory the exporter should do whatever is required to ensure that the
> CPU is synchronised when a user space mapping exists. There are some
> issues here though:
>
> * In theory the kernel driver should map the dma_buf purely for the
> duration that a job is using the buffer (and unmap immediately after).
> This gives the exporter the knowledge of when the GPU is using the
> memory and allows the exporter to page out of the memory if necessary.
> In practise this map/unmap operation is expensive (updating the GPU's
> page tables) so most drivers don't actually bother and keep the memory
> mapped. This means the exporter cannot tell when the buffer is used or
> move the pages.

Yeah refaulting is too slow if you care the least about performance.

> * The CPU mappings can be faulted on demand (performing the necessary
> CPU cache invalidate if needed) and shot-down to allow moving the
> memory. In theory when the GPU needs the memory it should map the buffer
> and the exporter can then shoot down the mappings, perform the CPU cache
> clean and then allow the GPU to use the memory. A subsequent CPU access
> would then refault the page, ensuring a CPU cache invalidate so the
> latest data is visible.

We thought that was the answer, until it was clear its not. dma-buf
mmap isn't coherent, you need to call the begin/end ioctls.

> * The majority of exporters are simple and deal with uncached memory
> (e.g. frame buffers) or are actually exporting back to the same driver
> (e.g. window surfaces). In these situations either the driver already
> has the necessary "magic" to deal with caches (e.g. kbase provides
> explicit cache maintenance operations), or it's "uncached" anyway so it
> doesn't matter. This means that hardly anyone tests with the complex
> cases...
>
> From a user space ABI, my understanding is that a dma_buf mmap() mapping
> should be coherent, and user space isn't expected to do anything to make
> it work. Obviously any importing device might have it's own coherency
> details which will be up to the ABI of that device (e.g. Mali has caches
> which may need to be flushed - this is usually done at the start/end of
> a job chain, so coherency is not guaranteed while the job chain is running).

See my other reply, but this isn't correct. dma-buf has explicit cache
maintenance ops. It's just that generally everyone (ok, display only
drivers using the cma helpers) ends up exporting coherent memory and
that's why this works. Doesn't make it a bright idea imo ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch