about mmap dma-buf and sync

Sun Aug 23 18:41:49 PDT 2015

On Sat, Aug 22, 2015 at 12:00:21AM +0200, Thomas Hellstrom wrote:
> On 08/21/2015 06:00 PM, Jerome Glisse wrote:
> > On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
> >> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
> >>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
> >>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> >>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> >>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> >>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >>>>>>>> Hi, Tiago!
> >>>>>>>>
> >>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>>>>>>>
> >>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
> >>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
> >>>>>>>> perspective.
> >>>>>>>>
> >>>>>>>> The fact that the interface is generic means that people will start
> >>>>>>>> using it for the zero-copy case. There has been a couple of more or less
> >>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
> >>>>>>>> don't need to be that careful but if it is a _generic_ interface we need
> >>>>>>>> to be very careful to make it fit *all* the hardware out there and that
> >>>>>>>> we make all potential users use the interface in a way that conforms
> >>>>>>>> with the interface specification.
> >>>>>>>>
> >>>>>>>> What will happen otherwise is that apps written for coherent fast
> >>>>>>>> hardware might, for example, ignore calling the SYNC api, just because
> >>>>>>>> the app writer only cared about his own hardware on which the app works
> >>>>>>>> fine. That would fail miserably if the same app was run on incoherent
> >>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
> >>>>>>>> to base an implementation on page-faults which would be very slow.
> >>>>>>>>
> >>>>>>>> So assume the following use case: An app updates a 10x10 area using the
> >>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
> >>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >>>>>>>> write and a DMA back again after the write, before GPU usage. On the
> >>>>>>>> tiled architecture the SYNC operation must untile before CPU access and
> >>>>>>>> probably tile again before GPU access.
> >>>>>>>>
> >>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
> >>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >>>>>>>> before writing, and then again after writing. If the app forgot to call
> >>>>>>>> SYNC we must error.
> >>>>>>>>
> >>>>>>>> So to summarize, the fact that the interface is generic IMO means:
> >>>>>>>>
> >>>>>>>> 1) Any user must be able to make valid assumptions about the internal
> >>>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
> >>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
> >>>>>>>> documented somewhere so that maintainers of drivers of uncoherent
> >>>>>>>> architectures have somewhere to point their fingers.
> >>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
> >>>>>>>> if SYNC has not been used.
> >>>>>>> I think here you are too lax, the common code must segfault or
> >>>>>>> error badly if SYNC has not been use in all cases even on cache
> >>>>>>> coherent arch. The device driver sync callback can still be a
> >>>>>>> no operation. But i think that we need to insist strongly on a
> >>>>>>> proper sync call being made (and we should forbid empty area
> >>>>>>> sync call). This would be the only way to make sure userspace
> >>>>>>> behave properly as otherwise we endup in the situation you were
> >>>>>>> describing above, where the app is design on a cache coherent
> >>>>>>> arch and works fine there but broke in subtle way on non cache
> >>>>>>> coherent arch and app developer is clueless of why.
> >>>>>>>
> >>>>>>> I just do not trust userspace.
> >>>>>> I agree and ideally i'd want it this way as well. The question is, is it
> >>>>>> possible? Can this be done without a fault() handler in the generic
> >>>>>> kernel code?
> >>>>>>
> >>>>>>>> 4) The use-case stated above clearly shows the benefit of a
> >>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >>>>>>>> what if someone in the future wants to use this interface for a 3D
> >>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >>>>>>>> extendable in a way that an enum sync_type member defines the layout of
> >>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >>>>>>>> for the future?
> >>>>>>>>
> >>>>>>>> Also, I agree there is probably no good way to generically implement an
> >>>>>>>> error if SYNC has not been called. That needs to be left as an option to
> >>>>>>>> drivers.
> >>>>>>> I think there is, just forbid any further use of the dma buffer, mark
> >>>>>>> it as invalid and printk a big error. Userspace app developer will
> >>>>>>> quickly see that something is wrong and looking at kernel log should
> >>>>>>> explain why.
> >>>>>> The problem is if someone calls mmap() and then decides to not access
> >>>>>> the buffer before munmap() or GPU usage. How do we recognize that case
> >>>>>> and separate it from a CPU access occured outside a sync? We could, as
> >>>>>> mentioned above have a fault() handler in the generic kernel code and
> >>>>>> make sure drivers don't populate PTEs until the first fault(). Another
> >>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
> >>>>>> even sure all CPU architectures have such a flag?
> >>>>>>
> >>>>>> But there might be a simpler solution that I have overlooked?
> >>>>> All arch have a dirty flag, so you could check that, as all we
> >>>>> really care about is CPU write access. So all you need is clearing
> >>>>> the dirty bit after each successfull GPU command stream ioctl
> >>>>> and checking that no dirty bit is set in a region not cover by
> >>>>> a flush(). Note that the clear and check can happen in the same
> >>>>> function as part of buffer validation for the GPU command stream.
> >>>> Actually, I do think we care about reading as well, since reading
> >>>> without flushing anything written by the GPU will
> >>>> yield garbage on a non-coherent architecture. Instead we might check for
> >>>> the PAGE_ACCESSED bit.
> >>> I do not think the read access after GPU matter. If people think it does
> >>> then it is simple, at buffer validation in buffer mapping we invalidate
> >>> CPU page table mapping and the prepare access ioctl populate them after
> >>> checking that GPU is flush.
> >>>
> >>>> So the full version of this would keep track of all synced regions and
> >>>> at buffer validation time, or unmap time error if there
> >>>> is a page completely outside the union of all synced regions in any VMA
> >>>> belonging to the dma-buf address space, then unmark all accessed PTEs
> >>>> and invalidate TLBs accordingly, typically causing a global TLB flush.
> >>>> Yeah, we could probably put together a helper function that does this,
> >>>> but it will be expensive to run and the whole point of this API is to be
> >>>> able to improve performance.
> >>> No need for such complexity, the sync ioctl would go over the page table
> >>> and clear dirty bit for all page in range that is synchronized. So once
> >>> you get to the command ioctl you know that all dirty bit must be clear.
> >>> Note this is only a page table walk in all case. I do not consider this
> >>> to be that expensive.
> >>>
> >>>
> >> The expensive part is the TLB flushing after the PTE update, which
> >> typically needs to be performed on all cores. Although without a
> >> benchmark I'm not sure how expensive. It might be that it doesn't
> >> matter. After all, the vmwgfx- and I believe the qxl driver both use a
> >> similar approach for fbdev user-space access, but that has never been
> >> cosidered performance-critical...
> > There is no need to flush the TLB ! At least not for the write case.
> > Clearing the dirty bit is cache coherent. TLB are about caching page
> > table directory tree. To convince yourself look at page_mkclean_one()
> > in mm/rmap.c which, by the way, is almost what we want (remove the mmu
> > notifier part and the write protection part and force real CPU cache
> > flush if needed base on device flag).
> 
> Hmm? Last time I visited this code, page_mkclean_one() was eventually
> calling flush_tlb_page() through ptep_clear_flush_notify(). I was under
> the impression that you always have to flush the TLB when you clear a
> PTE status bit, but typically not when you set it. Are you saying that
> that's not the case?

Well this is very down to the implementation details of hw, consideration
and my memory is that the AMD/Intel doc says that write is a locked read
modify write operation ie it is atomic. Now there is a bit of unknown on
how exactly this happens (does the hardware retraverse the full directory
hierarchy ? does it keeps a pointer to the page table entry ? ...).

So basicly this means that at time of validation if any CPU did write to
any page inside the range we would see the dirty set, we do not need to
flush anything to see that happening (unless things are mapped write
combine).

Well there is a race window if CPU is writing to the buffer while we are
doing the validation on another CPU. But there is no way around that race
except to unmap and flush TLB. Which is not what we want.

Now, there is something i am not sure, does CPU keep TLB entry around
with dirty bit set and do not redo the RMW locked sequence, or does it
do the RMW locked sequence everytime it write to a cacheline. I would
guess the former if only for perf reasons. In this case we would need
TLB flush but linux kernel already offer helpers to optimize this away.

All this kind of make me think that there is no sane way around this.
I mean we are trying to work around potentialy broken userspace mostly
because most of the GPU driver is in userspace and we can't trust it.
I think this is just a lost battle and it would be better to just
provide the API and says that you must sync and userspace that do not
will get to bear the consequence sooner or latter.

Cheers,
Jérôme