[Mesa-dev] [RFC PATCH] mesa: Export BOs in RW mode

Thu Jul 4 09:26:36 UTC 2019

On 03/07/2019 17:18, Daniel Vetter wrote:
> On Wed, Jul 3, 2019 at 6:11 PM Steven Price <steven.price at arm.com> wrote:
[...]
>> In theory the exporter should do whatever is required to ensure that the
>> CPU is synchronised when a user space mapping exists. There are some
>> issues here though:
>>
>> * In theory the kernel driver should map the dma_buf purely for the
>> duration that a job is using the buffer (and unmap immediately after).
>> This gives the exporter the knowledge of when the GPU is using the
>> memory and allows the exporter to page out of the memory if necessary.
>> In practise this map/unmap operation is expensive (updating the GPU's
>> page tables) so most drivers don't actually bother and keep the memory
>> mapped. This means the exporter cannot tell when the buffer is used or
>> move the pages.
> 
> Yeah refaulting is too slow if you care the least about performance.
> 
>> * The CPU mappings can be faulted on demand (performing the necessary
>> CPU cache invalidate if needed) and shot-down to allow moving the
>> memory. In theory when the GPU needs the memory it should map the buffer
>> and the exporter can then shoot down the mappings, perform the CPU cache
>> clean and then allow the GPU to use the memory. A subsequent CPU access
>> would then refault the page, ensuring a CPU cache invalidate so the
>> latest data is visible.
> 
> We thought that was the answer, until it was clear its not. dma-buf
> mmap isn't coherent, you need to call the begin/end ioctls.
> 
>> * The majority of exporters are simple and deal with uncached memory
>> (e.g. frame buffers) or are actually exporting back to the same driver
>> (e.g. window surfaces). In these situations either the driver already
>> has the necessary "magic" to deal with caches (e.g. kbase provides
>> explicit cache maintenance operations), or it's "uncached" anyway so it
>> doesn't matter. This means that hardly anyone tests with the complex
>> cases...
>>
>> From a user space ABI, my understanding is that a dma_buf mmap() mapping
>> should be coherent, and user space isn't expected to do anything to make
>> it work. Obviously any importing device might have it's own coherency
>> details which will be up to the ABI of that device (e.g. Mali has caches
>> which may need to be flushed - this is usually done at the start/end of
>> a job chain, so coherency is not guaranteed while the job chain is running).
> 
> See my other reply, but this isn't correct. dma-buf has explicit cache
> maintenance ops. It's just that generally everyone (ok, display only
> drivers using the cma helpers) ends up exporting coherent memory and
> that's why this works. Doesn't make it a bright idea imo ...

Sorry, I'd completely forgotten about the DMA_BUF_IOCTL_SYNC ioctl when
I wrote this. But to a large extent this ship has already sailed, and
indeed the current users of dma_buf_mmap() implicitly assume that no
sync is necessary (since there's no mechanism to forward the syncs onto
the exporter). Indeed the comment in dma-buf.c says:

 *   The assumption in the current dma-buf interfaces is that redirecting the
 *   initial mmap is all that's needed. A survey of some of the existing
 *   subsystems shows that no driver seems to do any nefarious thing like
 *   syncing up with outstanding asynchronous processing on the device or
 *   allocating special resources at fault time. So hopefully this is good
 *   enough, since adding interfaces to intercept pagefaults and allow pte
 *   shootdowns would increase the complexity quite a bit.

I'd be happy to kill off dma_buf_mmap() and require user space to mmap
via the dma_buf handle (and also therefore to handle the manual sync
ioctls), but I'm not sure how we achieve that while maintaining
backwards compatibility.

Thanks,

Steve