[Mesa-dev] [PATCH 1/2] i965/bufmgr: Explicitly wait instead of using I915_GEM_SET_DOMAIN.

Fri Jul 21 20:28:02 UTC 2017

Quoting Kenneth Graunke (2017-07-18 06:14:52)
> With the advent of asynchronous maps, domain tracking doesn't make a
> whole lot of sense.  Buffers can be in use on both the CPU and GPU at
> the same time.  In order to avoid blocking, we stopped using set_domain
> for asynchronous mappings, which means that the kernel's tracking has
> lies.  We can't properly track it in userspace either, as the kernel
> can change domains on us spontaneously (for example, when un-swapping).
> 
> I915_GEM_SET_DOMAIN combines two aspects: cache flushing, and waiting
> for a buffer to be idle.  In order to stop using it, we'll need to do
> clflushing ourselves, and use I915_GEM_WAIT to wait on buffers.
> 
> For cache-coherent buffers (most on LLC systems), we don't need to do
> any clflushing - the CPU and GPU views are coherent.  For non-coherent
> buffers (most on non-LLC systems), we currently only use the CPU for
> read-only maps, and we explicitly clflush when necessary.
> ---
>  src/mesa/drivers/dri/i965/brw_bufmgr.c | 23 ++++++-----------------
>  1 file changed, 6 insertions(+), 17 deletions(-)
> 
> Here's a strawman patch for Chris to shoot down. :)  I'm probably
> missing something obvious, but it seems like our existing
> gen_invalidate_range when CPU mapping non-coherent buffers ought
> to be sufficient.  If we were doing CPU write mapping, we'd
> definitely need more tracking, but we don't, so...

Here's what set_domain does in the kernel:

 1. pins the backing storage (acquiring pages outside of the
   struct_mutex)

 2. waits either for read/write access, including inter-device waits

 3. updates the domain, clflushing as required

 4. marks the object as used (for swapping)

 5. turns off FBC/PSR/fancy scanout caching

As discussed (1) is of little relevance, but a set-domain just after
creation is equally effective for the few bo that are not recycled.

(3) is already handled by mesa and (4) we don't care that much as if we
swap we've lost anyway, and the LRU tracking from GPU activity is
probably more relevant anyway. Note also since the use of ASYNC, the
kernel domains are already inaccurate. (This will only cause us trouble
when we write through the GTT and cause a readback via some other
means.)

(2) we lose out on the possibility of doing read-read optimisations,
which will probably go unnoticed. We can extended the wait ioctl for
read-only support if desired (and this will give me a good use case for
a patch I have to add support).

The lose of scanout invalidation is the only potential trouble. We have
already made the adjustment to not accommodate direct access to
scanouts through the brw_bo_map facility (with the removal of SW_FINISH,
though that is supposed to be replaced by the compositor calling
FB_DIRTY). The effect of the removal of intel_fb_obj_invalidate() should
be delayed output if the screen is using FBC/PSR. (Note if the screen is
using FBC/PSR, ideally we want to always use a backbuffer and flip.)

To map_gtt(), I would add something like

/*
Write access through the GTT is not quite fully coherent. On low power
systems especially, like modern Atoms, we can observe reads from RAM
before the write via GTT has landed. A write memory barrier that flushes
the Write Combining Buffer (i.e. sfence/mfence) is not sufficient to order
the later read after the write as the GTT write suffers a small delay
through the GTT indirection. The kernel uses an uncached mmio read to
ensure the GTT write is ordered with reads (either by the GPU, WB or WC)
and unconditionally flushes prior to execbuf submission. However, if
we are not informing the kernel about our GTT writes, it will not flush
before earlier access, such as when using the cmdparser. Similarly, we
need to be careful if we should ever issue a CPU read immediately
following a GTT write.

Telling the kernel about write access also has one more important
side-effect. Upon receiving notification about the write, it cancels any
scanout buffering for FBC/PSR and friends. Later FBC/PSR is then flushed
by either SW_FINISH or DIRTYFB. The presumption is that we never write
to the actual scanout via a mmaping, only to a backbuffer and so all the
FBC/PSR tracking is handled on the buffer exchange instead.
*/

Considering the above, the paranoid approach would be to add the
if(bo->map_gtt) READ_ONCE() trick to the clflush in brw_bo_map_cpu, and
similarly when replacing pwrite to explicitly flush the batch. The plan
is that we don't write to those objects via GTT in the first place!
-Chris