about mmap dma-buf and sync

Mon Aug 24 02:59:57 PDT 2015

On 08/24/2015 03:41 AM, Jerome Glisse wrote:
> On Sat, Aug 22, 2015 at 12:00:21AM +0200, Thomas Hellstrom wrote:
>> On 08/21/2015 06:00 PM, Jerome Glisse wrote:
>>> On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
>>>> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
>>>>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
>>>>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
>>>>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>>>>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>>>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>> Hi, Tiago!
>>>>>>>>>>
>>>>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
>>>>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>>>>>>>> perspective.
>>>>>>>>>>
>>>>>>>>>> The fact that the interface is generic means that people will start
>>>>>>>>>> using it for the zero-copy case. There has been a couple of more or less
>>>>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>>>>>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>>>>>>>> to be very careful to make it fit *all* the hardware out there and that
>>>>>>>>>> we make all potential users use the interface in a way that conforms
>>>>>>>>>> with the interface specification.
>>>>>>>>>>
>>>>>>>>>> What will happen otherwise is that apps written for coherent fast
>>>>>>>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>>>>>>>> the app writer only cared about his own hardware on which the app works
>>>>>>>>>> fine. That would fail miserably if the same app was run on incoherent
>>>>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>>>>>>>> to base an implementation on page-faults which would be very slow.
>>>>>>>>>>
>>>>>>>>>> So assume the following use case: An app updates a 10x10 area using the
>>>>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>>>>>>>> write and a DMA back again after the write, before GPU usage. On the
>>>>>>>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>>>>>>>> probably tile again before GPU access.
>>>>>>>>>>
>>>>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>>>>>>>> before writing, and then again after writing. If the app forgot to call
>>>>>>>>>> SYNC we must error.
>>>>>>>>>>
>>>>>>>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>>>>>>>
>>>>>>>>>> 1) Any user must be able to make valid assumptions about the internal
>>>>>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>>>>>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>>>>>>>> architectures have somewhere to point their fingers.
>>>>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>>>>>>>> if SYNC has not been used.
>>>>>>>>> I think here you are too lax, the common code must segfault or
>>>>>>>>> error badly if SYNC has not been use in all cases even on cache
>>>>>>>>> coherent arch. The device driver sync callback can still be a
>>>>>>>>> no operation. But i think that we need to insist strongly on a
>>>>>>>>> proper sync call being made (and we should forbid empty area
>>>>>>>>> sync call). This would be the only way to make sure userspace
>>>>>>>>> behave properly as otherwise we endup in the situation you were
>>>>>>>>> describing above, where the app is design on a cache coherent
>>>>>>>>> arch and works fine there but broke in subtle way on non cache
>>>>>>>>> coherent arch and app developer is clueless of why.
>>>>>>>>>
>>>>>>>>> I just do not trust userspace.
>>>>>>>> I agree and ideally i'd want it this way as well. The question is, is it
>>>>>>>> possible? Can this be done without a fault() handler in the generic
>>>>>>>> kernel code?
>>>>>>>>
>>>>>>>>>> 4) The use-case stated above clearly shows the benefit of a
>>>>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>>>>>>>> what if someone in the future wants to use this interface for a 3D
>>>>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>>>>>>>> extendable in a way that an enum sync_type member defines the layout of
>>>>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>>>>>>>> for the future?
>>>>>>>>>>
>>>>>>>>>> Also, I agree there is probably no good way to generically implement an
>>>>>>>>>> error if SYNC has not been called. That needs to be left as an option to
>>>>>>>>>> drivers.
>>>>>>>>> I think there is, just forbid any further use of the dma buffer, mark
>>>>>>>>> it as invalid and printk a big error. Userspace app developer will
>>>>>>>>> quickly see that something is wrong and looking at kernel log should
>>>>>>>>> explain why.
>>>>>>>> The problem is if someone calls mmap() and then decides to not access
>>>>>>>> the buffer before munmap() or GPU usage. How do we recognize that case
>>>>>>>> and separate it from a CPU access occured outside a sync? We could, as
>>>>>>>> mentioned above have a fault() handler in the generic kernel code and
>>>>>>>> make sure drivers don't populate PTEs until the first fault(). Another
>>>>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>>>>>>>> even sure all CPU architectures have such a flag?
>>>>>>>>
>>>>>>>> But there might be a simpler solution that I have overlooked?
>>>>>>> All arch have a dirty flag, so you could check that, as all we
>>>>>>> really care about is CPU write access. So all you need is clearing
>>>>>>> the dirty bit after each successfull GPU command stream ioctl
>>>>>>> and checking that no dirty bit is set in a region not cover by
>>>>>>> a flush(). Note that the clear and check can happen in the same
>>>>>>> function as part of buffer validation for the GPU command stream.
>>>>>> Actually, I do think we care about reading as well, since reading
>>>>>> without flushing anything written by the GPU will
>>>>>> yield garbage on a non-coherent architecture. Instead we might check for
>>>>>> the PAGE_ACCESSED bit.
>>>>> I do not think the read access after GPU matter. If people think it does
>>>>> then it is simple, at buffer validation in buffer mapping we invalidate
>>>>> CPU page table mapping and the prepare access ioctl populate them after
>>>>> checking that GPU is flush.
>>>>>
>>>>>> So the full version of this would keep track of all synced regions and
>>>>>> at buffer validation time, or unmap time error if there
>>>>>> is a page completely outside the union of all synced regions in any VMA
>>>>>> belonging to the dma-buf address space, then unmark all accessed PTEs
>>>>>> and invalidate TLBs accordingly, typically causing a global TLB flush.
>>>>>> Yeah, we could probably put together a helper function that does this,
>>>>>> but it will be expensive to run and the whole point of this API is to be
>>>>>> able to improve performance.
>>>>> No need for such complexity, the sync ioctl would go over the page table
>>>>> and clear dirty bit for all page in range that is synchronized. So once
>>>>> you get to the command ioctl you know that all dirty bit must be clear.
>>>>> Note this is only a page table walk in all case. I do not consider this
>>>>> to be that expensive.
>>>>>
>>>>>
>>>> The expensive part is the TLB flushing after the PTE update, which
>>>> typically needs to be performed on all cores. Although without a
>>>> benchmark I'm not sure how expensive. It might be that it doesn't
>>>> matter. After all, the vmwgfx- and I believe the qxl driver both use a
>>>> similar approach for fbdev user-space access, but that has never been
>>>> cosidered performance-critical...
>>> There is no need to flush the TLB ! At least not for the write case.
>>> Clearing the dirty bit is cache coherent. TLB are about caching page
>>> table directory tree. To convince yourself look at page_mkclean_one()
>>> in mm/rmap.c which, by the way, is almost what we want (remove the mmu
>>> notifier part and the write protection part and force real CPU cache
>>> flush if needed base on device flag).
>> Hmm? Last time I visited this code, page_mkclean_one() was eventually
>> calling flush_tlb_page() through ptep_clear_flush_notify(). I was under
>> the impression that you always have to flush the TLB when you clear a
>> PTE status bit, but typically not when you set it. Are you saying that
>> that's not the case?
> Well this is very down to the implementation details of hw, consideration
> and my memory is that the AMD/Intel doc says that write is a locked read
> modify write operation ie it is atomic. Now there is a bit of unknown on
> how exactly this happens (does the hardware retraverse the full directory
> hierarchy ? does it keeps a pointer to the page table entry ? ...).
>
> So basicly this means that at time of validation if any CPU did write to
> any page inside the range we would see the dirty set, we do not need to
> flush anything to see that happening (unless things are mapped write
> combine).
>
> Well there is a race window if CPU is writing to the buffer while we are
> doing the validation on another CPU. But there is no way around that race
> except to unmap and flush TLB. Which is not what we want.
>
> Now, there is something i am not sure, does CPU keep TLB entry around
> with dirty bit set and do not redo the RMW locked sequence, or does it
> do the RMW locked sequence everytime it write to a cacheline. I would
> guess the former if only for perf reasons. In this case we would need
> TLB flush but linux kernel already offer helpers to optimize this away.
>
>
> All this kind of make me think that there is no sane way around this.
> I mean we are trying to work around potentialy broken userspace mostly
> because most of the GPU driver is in userspace and we can't trust it.
> I think this is just a lost battle and it would be better to just
> provide the API and says that you must sync and userspace that do not
> will get to bear the consequence sooner or latter.
>
> Cheers,
> Jérôme

It would still be possible to track down those cases where SYNC is
completely ignored, by traversing the page tables looking for
PAGE_ACCESSED without any SYNC whatsoever. Sure, userspace could find
ways around that but it would still force people to read up on the SYNC
functionality.

/Thomas