about mmap dma-buf and sync
thellstrom at vmware.com
Thu Aug 20 22:25:07 PDT 2015
On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>> Hi, Tiago!
>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>> The fact that the interface is generic means that people will start
>>>> using it for the zero-copy case. There has been a couple of more or less
>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>> to be very careful to make it fit *all* the hardware out there and that
>>>> we make all potential users use the interface in a way that conforms
>>>> with the interface specification.
>>>> What will happen otherwise is that apps written for coherent fast
>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>> the app writer only cared about his own hardware on which the app works
>>>> fine. That would fail miserably if the same app was run on incoherent
>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>> to base an implementation on page-faults which would be very slow.
>>>> So assume the following use case: An app updates a 10x10 area using the
>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>> write and a DMA back again after the write, before GPU usage. On the
>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>> probably tile again before GPU access.
>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>> before writing, and then again after writing. If the app forgot to call
>>>> SYNC we must error.
>>>> So to summarize, the fact that the interface is generic IMO means:
>>>> 1) Any user must be able to make valid assumptions about the internal
>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>> architectures have somewhere to point their fingers.
>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>> if SYNC has not been used.
>>> I think here you are too lax, the common code must segfault or
>>> error badly if SYNC has not been use in all cases even on cache
>>> coherent arch. The device driver sync callback can still be a
>>> no operation. But i think that we need to insist strongly on a
>>> proper sync call being made (and we should forbid empty area
>>> sync call). This would be the only way to make sure userspace
>>> behave properly as otherwise we endup in the situation you were
>>> describing above, where the app is design on a cache coherent
>>> arch and works fine there but broke in subtle way on non cache
>>> coherent arch and app developer is clueless of why.
>>> I just do not trust userspace.
>> I agree and ideally i'd want it this way as well. The question is, is it
>> possible? Can this be done without a fault() handler in the generic
>> kernel code?
>>>> 4) The use-case stated above clearly shows the benefit of a
>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>> what if someone in the future wants to use this interface for a 3D
>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>> extendable in a way that an enum sync_type member defines the layout of
>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>> for the future?
>>>> Also, I agree there is probably no good way to generically implement an
>>>> error if SYNC has not been called. That needs to be left as an option to
>>> I think there is, just forbid any further use of the dma buffer, mark
>>> it as invalid and printk a big error. Userspace app developer will
>>> quickly see that something is wrong and looking at kernel log should
>>> explain why.
>> The problem is if someone calls mmap() and then decides to not access
>> the buffer before munmap() or GPU usage. How do we recognize that case
>> and separate it from a CPU access occured outside a sync? We could, as
>> mentioned above have a fault() handler in the generic kernel code and
>> make sure drivers don't populate PTEs until the first fault(). Another
>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>> even sure all CPU architectures have such a flag?
>> But there might be a simpler solution that I have overlooked?
> All arch have a dirty flag, so you could check that, as all we
> really care about is CPU write access. So all you need is clearing
> the dirty bit after each successfull GPU command stream ioctl
> and checking that no dirty bit is set in a region not cover by
> a flush(). Note that the clear and check can happen in the same
> function as part of buffer validation for the GPU command stream.
Actually, I do think we care about reading as well, since reading
without flushing anything written by the GPU will
yield garbage on a non-coherent architecture. Instead we might check for
the PAGE_ACCESSED bit.
So the full version of this would keep track of all synced regions and
at buffer validation time, or unmap time error if there
is a page completely outside the union of all synced regions in any VMA
belonging to the dma-buf address space, then unmark all accessed PTEs
and invalidate TLBs accordingly, typically causing a global TLB flush.
Yeah, we could probably put together a helper function that does this,
but it will be expensive to run and the whole point of this API is to be
able to improve performance.
IMO we should probably do this and after relevant performance testing
decide whether to either expose a way to turn it off, either using a
compile-time option or run-time, like a flag in the SYNC ioctl.
More information about the dri-devel