[Mesa-dev] [RFC PATCH] mesa: Export BOs in RW mode

Wed Jul 3 15:07:51 UTC 2019

On Wed, 3 Jul 2019 15:50:08 +0100
Steven Price <steven.price at arm.com> wrote:

> On 03/07/2019 15:33, Boris Brezillon wrote:
> > On Wed, 3 Jul 2019 15:13:25 +0100
> > Steven Price <steven.price at arm.com> wrote:
> >   
> >> On 03/07/2019 14:56, Boris Brezillon wrote:  
> >>> On Wed, 3 Jul 2019 07:45:32 -0600
> >>> Rob Herring <robh+dt at kernel.org> wrote:
> >>>     
> >>>> On Wed, Jul 3, 2019 at 7:34 AM Boris Brezillon
> >>>> <boris.brezillon at collabora.com> wrote:    
> >>>>>
> >>>>> Exported BOs might be imported back, then mmap()-ed to be written
> >>>>> too. Most drivers handle that by mmap()-ing the GEM handle after it's
> >>>>> been imported, but, according to [1], this is illegal.      
> >>>>
> >>>> It's not illegal, but is supposed to go thru the dmabuf mmap
> >>>> functions.    
> >>>
> >>> That's basically what I'm proposing here, just didn't post the patch
> >>> skipping the GET_OFFSET step and doing the mmap() on the dmabuf FD
> >>> instead of the DRM-node one, but I have it working for panfrost.    
> >>
> >> If we want to we could make the Panfrost kernel driver internally call
> >> dma_buf_mmap() so that mapping using the DRM-node "just works". This is
> >> indeed what the kbase driver does.  
> > 
> > Well, userspace should at least skip DRM_IOCTL_PANFROST_MMAP_BO (or
> > ignore its return code), so calling mmap() on the dmabuf FD instead of
> > the DRM-node FD shouldn't be that hard.  
> 
> What I was suggesting is that user space would still call
> DRM_IOCTL_PANFROST_MMAP_BO to get an offset which uses in a call to
> mmap(..., drm_node_fd, offset). The kernel could detect that the buffer
> is imported and call the exporter for the actual mmap() functionality.

Oops, sorry, brain fart. I thought it was DRM_IOCTL_PANFROST_MMAP_BO
that was failing, but it's actually the mmap() call, so providing this
wrapper kernel-side should work.

> 
> The alternative is that user space 'simply' remembers that a buffer is
> imported and keeps the file descriptor around so that it can instead
> directly mmap() the dma_buf fd. Which is certainly easiest from the
> kernel's perspective (and was what I assumed panfrost was doing - I
> should have checked more closely!).
> 
> >>>> However, none of the driver I've looked at (etnaviv, msm,
> >>>> v3d, vgem) do that. It probably works because it's the same driver
> >>>> doing the import and export or both drivers have essentially the same
> >>>> implementations.    
> >>>
> >>> Yes, but maybe that's something we should start fixing if mmap()-ing
> >>> the dmabuf is the recommended solution.    
> >>
> >> I'm open to options here. User space calling mmap() on the dma_buf file
> >> descriptor should always be safe (the exporter can do whatever is
> >> necessary to make it work). If that happens then the patches I posted
> >> close off the DRM node version which could be broken if the exporter
> >> needs to do anything to prepare the buffer for CPU access (i.e. cache
> >> maintenance).  
> > 
> > Talking about CPU <-> GPU syncs, I was wondering if the
> > mmap(gem_handle) step was providing any guarantee that would
> > allow us to ignore all the cache maintenance operations that are
> > required when mmap()-ing a dmabuf directly. Note that in both cases the
> > dmabuf is imported.  
> 
> In theory the exporter should do whatever is required to ensure that the
> CPU is synchronised when a user space mapping exists. There are some
> issues here though:
> 
> * In theory the kernel driver should map the dma_buf purely for the
> duration that a job is using the buffer (and unmap immediately after).
> This gives the exporter the knowledge of when the GPU is using the
> memory and allows the exporter to page out of the memory if necessary.
> In practise this map/unmap operation is expensive (updating the GPU's
> page tables) so most drivers don't actually bother and keep the memory
> mapped. This means the exporter cannot tell when the buffer is used or
> move the pages.
> 
> * The CPU mappings can be faulted on demand (performing the necessary
> CPU cache invalidate if needed) and shot-down to allow moving the
> memory. In theory when the GPU needs the memory it should map the buffer
> and the exporter can then shoot down the mappings, perform the CPU cache
> clean and then allow the GPU to use the memory. A subsequent CPU access
> would then refault the page, ensuring a CPU cache invalidate so the
> latest data is visible.
> 
> * The majority of exporters are simple and deal with uncached memory
> (e.g. frame buffers) or are actually exporting back to the same driver
> (e.g. window surfaces). In these situations either the driver already
> has the necessary "magic" to deal with caches (e.g. kbase provides
> explicit cache maintenance operations), or it's "uncached" anyway so it
> doesn't matter. This means that hardly anyone tests with the complex
> cases...
> 
> From a user space ABI, my understanding is that a dma_buf mmap() mapping
> should be coherent, and user space isn't expected to do anything to make
> it work. Obviously any importing device might have it's own coherency
> details which will be up to the ABI of that device (e.g. Mali has caches
> which may need to be flushed - this is usually done at the start/end of
> a job chain, so coherency is not guaranteed while the job chain is running).

Thanks for the detailed explanation.