[PATCH 2/5] drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep

Wed Jan 22 07:09:51 PST 2014

On Wed, Jan 22, 2014 at 01:52:51PM +0100, Thomas Hellstrom wrote:
> On 01/22/2014 01:38 PM, Maarten Lankhorst wrote:
> > op 22-01-14 13:11, Thomas Hellstrom schreef:
> >> On 01/22/2014 11:58 AM, Maarten Lankhorst wrote:
> >>> op 22-01-14 11:27, Thomas Hellstrom schreef:
> >>>> On 01/22/2014 10:55 AM, Maarten Lankhorst wrote:
> >>>>> op 22-01-14 10:40, Thomas Hellstrom schreef:
> >>>>>> On 01/22/2014 09:19 AM, Maarten Lankhorst wrote:
> >>>>>>> op 21-01-14 18:44, Thomas Hellstrom schreef:
> >>>>>>>> On 01/21/2014 04:29 PM, Maarten Lankhorst wrote:
> >>>>>>>>> Hey,
> >>>>>>>>>
> >>>>>>>>> op 21-01-14 16:17, Thomas Hellstrom schreef:
> >>>>>>>>>> Maarten, for this and the other patches in this series,
> >>>>>>>>>>
> >>>>>>>>>> I seem to recall we have this discussion before?
> >>>>>>>>>> IIRC I stated that reservation was a too heavy-weight lock to
> >>>>>>>>>> hold to
> >>>>>>>>>> determine whether a buffer was idle? It's a pretty nasty
> >>>>>>>>>> thing to
> >>>>>>>>>> build in.
> >>>>>>>>>>
> >>>>>>>>> I've sent this patch after determining that this already didn't
> >>>>>>>>> end up
> >>>>>>>>> being heavyweight.
> >>>>>>>>> Most places were already using the fence_lock and reservation, I
> >>>>>>>>> just
> >>>>>>>>> fixed up the few
> >>>>>>>>> places that didn't hold a reservation while waiting.
> >>>>>>>>> Converting the
> >>>>>>>>> few places that didn't
> >>>>>>>>> ended up being trivial, so I thought I'd submit it.
> >>>>>>>> Actually the only *valid* reason for holding a reservation when
> >>>>>>>> waiting
> >>>>>>>> for idle is
> >>>>>>>> 1) You want to block further command submission on the buffer.
> >>>>>>>> 2) You want to switch GPU engine and don't have access to gpu
> >>>>>>>> semaphores
> >>>>>>>> / barriers.
> >>>>>>>>
> >>>>>>>> Reservation has the nasty side effect that it blocks command
> >>>>>>>> submission
> >>>>>>>> and pins the buffer (in addition now makes the evict list
> >>>>>>>> traversals
> >>>>>>>> skip the buffer) which in general is *not* necessary for most wait
> >>>>>>>> cases, so we should instead actually convert the wait cases that
> >>>>>>>> don't
> >>>>>>>> fulfill 1) and 2) above in the other direction if we have
> >>>>>>>> performance
> >>>>>>>> and latency-reduction in mind. I can't see how a spinlock
> >>>>>>>> protecting a
> >>>>>>>> fence pointer or fence list is stopping you from using RW
> >>>>>>>> fences as
> >>>>>>>> long
> >>>>>>>> as the spinlock is held while manipulating the fence list?
> >>>>>>>>
> >>>>>>> You wish. Fine I'll enumerate all cases of ttm_bo_wait (with the
> >>>>>>> patchset, though) and enumerate if they can be changed to work
> >>>>>>> without
> >>>>>>> reservation or not.
> >>>>>>>
> >>>>>>> ttm/ttm_bo.c
> >>>>>>> ttm_bo_cleanup_refs_or_queue: needs reservation and ttm_bo_wait to
> >>>>>>> finish for the direct destroy fastpath, if either fails it needs
> >>>>>>> to be
> >>>>>>> queued. Cannot work without reservation.
> >>>>>> Doesn't block and no significant reservation contention expected.
> >>>>>>
> >>>>>>> ttm_bo_cleanup_refs_and_unlock: already drops reservation to wait,
> >>>>>>> doesn't need to re-acquire. Simply reordering ttm_bo_wait until
> >>>>>>> after
> >>>>>>> re-reserve is enough.
> >>>>>> Currently follows the above rules.
> >>>>>>
> >>>>>>> ttm_bo_evict: already has the reservation, cannot be dropped since
> >>>>>>> only trylock is allowed. Dropping reservation would cause badness,
> >>>>>>> cannot be converted.
> >>>>>> Follows rule 2 above. We're about to move the buffer and if that
> >>>>>> can't
> >>>>>> be pipelined using the GPU (which TTM currently doesn't allow), we
> >>>>>> need
> >>>>>> to wait. Although eviction should be low priority compared to new
> >>>>>> command submission, so I can't really see why we couldn't wait
> >>>>>> before
> >>>>>> trying to reserve here?
> >>>>>>
> >>>>>>> ttm_bo_move_buffer: called from ttm_bo_validate, cannot drop
> >>>>>>> reservation for same reason as ttm_bo_evict. It might be part of a
> >>>>>>> ticketed reservation so really don't drop lock here.
> >>>>>> Part of command submission and as such follows rule 2 above. If
> >>>>>> we can
> >>>>>> pipeline the move with the GPU, no need to wait (but needs to be
> >>>>>> implemented, of course).
> >>>>>>
> >>>>>>> ttm_bo_synccpu_write_grab: the wait could be converted to be done
> >>>>>>> afterwards, without  fence_lock. But in this case reservation could
> >>>>>>> take the role of fence_lock too,
> >>>>>>>
> >>>>>>> so no separate fence_lock would be needed.
> >>>>>> With the exception that reservation is more likely to be contended.
> >>>>> True but rule 1.
> >>>>>>> ttm_bo_swapout: see ttm_bo_evict.
> >>>>>>>
> >>>>>>> ttm/ttm_bo_util.c:
> >>>>>>> ttm_bo_move_accel_cleanup: calls ttm_bo_wait, cannot drop lock, see
> >>>>>>> ttm_bo_move_buffer, can be called from that function.
> >>>>>> Rule 2.
> >>>>>>
> >>>>>>> ttm/ttm_bo_vm.c
> >>>>>>> ttm_bo_vm_fault_idle: I guess you COULD drop the reservation here,
> >>>>>>> but
> >>>>>>> you already had the reservation, so a similar optimization to
> >>>>>>> ttm_bo_synccpu_write_grab could be done without requiring
> >>>>>>> fence_lock.
> >>>>>>> If you would write it like that, you would end up with a patch
> >>>>>>> similar
> >>>>>>> to drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep. I
> >>>>>>> think
> >>>>>>> we should do this, an
> >>>>>>>
> >>>>>>> Ok, so the core does NOT need fence_lock because we can never drop
> >>>>>>> reservations except in synccpu_write_grab and maybe
> >>>>>>> ttm_bo_vm_fault_idle, but even in those cases reservation is
> >>>>>>> done. So
> >>>>>>> that could be used instead of fence_lock.
> >>>>>>>
> >>>>>>> nouveau_gem_ioctl_cpu_prep:
> >>>>>>> Either block on a global spinlock or a local reservation lock.
> >>>>>>> Doesn't
> >>>>>>> matter much which, I don't need the need to keep a global lock for
> >>>>>>> this function...
> >>>>>>> 2 cases can happen in the trylock reservation failure case:
> >>>>>>> buffer is
> >>>>>>> not reserved, so it's not in the process of being evicted.
> >>>>>>> buffer is
> >>>>>>> reserved, which means it's being used in command submission right
> >>>>>>> now,
> >>>>>>> or in one of the functions described above (eg not idle).
> >>>>>>>
> >>>>>>> nouveau_gem_pushbuf_reloc_apply:
> >>>>>>> has to call ttm_bo_wait with reservation, cannot be dropped.
> >>>>>>>
> >>>>>>> So for core ttm and nouveau the fence_lock is never needed, radeon
> >>>>>>> has
> >>>>>>> only 1 function that calls ttm_bo_wait which uses a reservation
> >>>>>>> too.
> >>>>>>> It doesn't need the fence_lock either.
> >>>>>> And vmwgfx now also has a syccpu IOCTL (see drm-next).
> >>>>>>
> >>>>>> So assuming that we converted the functions that can be converted to
> >>>>>> wait outside of reservation, the same way you have done with
> >>>>>> Nouveau,
> >>>>>> leaving the ones that fall under 1) and 2) above, I would still
> >>>>>> argue
> >>>>>> that a spinlock should be used because taking a reservation may
> >>>>>> implicitly mean wait for gpu, and could give bad performance- and
> >>>>>> latency charateristics. You shouldn't need to wait for gpu to check
> >>>>>> for
> >>>>>> buffer idle.
> >>>>> Except that without reservation you can't tell if the buffer is
> >>>>> really
> >>>>> idle, or is currently being
> >>>>> used as part of some command submission/eviction before the fence
> >>>>> pointer is set.
> >>>>>
> >>>> Yes, but when that matters, you're either in case 1 or case 2 again.
> >>>> Otherwise, when you release the reservation, you still don't know.
> >>>> A typical example of this is the vmwgfx synccpu ioctl, where you can
> >>>> either choose to block command submission (not used currently)
> >>>> or not (user-space inter-process synchronization). The former is a
> >>>> case
> >>>> 1 wait and holds reservation while waiting for idle and then ups
> >>>> cpu_writers. The latter waits without reservation for previously
> >>>> submitted rendering to finish.
> >>> Yeah you could, but what exactly are you waiting on then? If it's some
> >>> specific existing rendering,
> >>> I would argue that you should create an android userspace fence during
> >>> command submission,
> >>> or provide your own api to block on a specfic fence in userspace.
> >>>
> >>> If you don't then I think taking a reservation is not unreasonable. In
> >>> the most common case the buffer is
> >>> idle and not reserved, so it isn't contested. The actual waiting
> >>> itself can be done without reservation held,
> >>> by taking a reference on the fence.
> >> Yeah, here is where we disagree. I'm afraid people will start getting
> >> sloppy with reservations and use them to protect more stuff, and after a
> >> while they start wondering why the GPU command queue drains...
> >>
> >> Perhaps we could agree on a solution (building on one of your original
> >> ideas) where we require reservation to modify the fence pointers, and
> >> the buffer object moving flag, but the structure holding the fence
> >> pointer(s) is RCU safe, so that the pointers can be safely read under an
> >> rcu lock.
> > I think not modifying the fence pointer without reservation would be
> > safest.
> > I also don't think readers need the capability to clear sync_obj, this
> > might
> > simplify the implementation some.
> >
> > But my preferred option is getting rid of sync_obj completely, and
> > move to
> > using reservation_object->fence_shared/exclusive, like the incomplete
> > proof
> > of concept conversion done in nouveau. But then I do need to grab the
> > reservation lock to touch things, because fences may be set by the
> > i915 driver
> > I share the reservation_object with.
> >
> > Alternatively could vmwgfx hold a spinlock when decrementing fence
> > refcount instead?
> > Then we wouldn't need this in the core, and vmwgfx could use:
> 
> Maarten,
> requiring reservation to access the fence pointers really turns my gut!
> Being able to read them under rcu is a remedy, but something I figure
> would be the default and recommended thing to do. Not a vmware
> exception. This is about as far as I'm prepared to go.

Let me jump into your discussion and have a bit of fun too ;-)

More seriously I think we should take a step back and look at the larger
picture: The overall aim is to allow cross-device shared dma-bufs to get
fenced/reserved/whatever. Which means the per-device fence_lock ttm is
currently using won't work any more. So we need to change things a bit.

I see a few solutions. Note that I haven't checked the implications for
existing drivers (especially ttm) in detail, so please correct me when
some of these ideas are horrible to implement:

- Make fence_lock a global thing instead of per-device. Probably not what
  we want given that dma-buf (and also all the ttm state) has more
  fine-grained locking.

- Remove the fence_lock and protect fences with the reservation lock the
  dma-buf already has. Has the appeal of being the simplest solution, at
  least if we exclude the One Lock to Rule Them all approach ;-)

- Add a new per-buffer spinlock just to protect the fences. Could end up
  being rather costly for the non-contended common case where we just want
  to push tons of buffers through execbuf ioctls.

- Allow fences attached to dma-bufs to be accessed read-only (to grab
  references of them and wait on them) using rcu as protection. I think we
  need some trickery with kref_get_unless_zero to make sure the
  rcu-delayed freeing of fences doesn't race in bad ways with lockless
  kref_gets. Another approach would be to rcu-delay all kref_puts, but I
  don't think we want that.

Personally I prefer the 2nd approach since it's the simplest, while not
being un-scalable like the first. In my experience with the single lock in
i915 where any contention and especially any waiting while holding locks
is massively exagerated is that locking dropping games around the common
gpu wait points are sufficient. Actually in almost all case the fence_lock
wouldn't be sufficient for us since we need to check buffer placement,
mmaps and similar things anyway.

Now I see that there's valid cases where we want the lowest possible
overhead for waiting on or just checking for outstanding rendering. OpenCL
with it's fancy/explicit synchronization model seems to be the prime
example that usually pops up. For such uses I think it's better to just
directly expose fences to userspace and completely eshew any indirection
through buffer objects.

That leaves the issue of stalling unrelated processes when they try to
look up fences when another process is heavily thrashing vram and so holds
tons of reservations on non-shared objects and blocks waiting for the gpu
to complete more batches. But from my cursory understanding non-UMA
platforms currently (at least with the drivers we have and the memory
management logic) drop off a rather steep cliff. So I fear that
micro-optimizing this case is investing complexity into the wrong place.
At least in i915 we've implemented quite a pile of tricks to smooth off
the gtt thrashing cliff before even considering improving lock contention
and lock holding times.

So overall I'm heavily favouring the simple approach of just reusing the
reservation ww mutex to protec fence state, but I'm definitely not
rejecting more complex approaches out of hand. I just think that we should
have solid data to justify the complexity.

Finally if we can't reach an agreement here I guess we could duct-tape
something together where ttm objects only used by a single driver are
protected by the fence_lock and other, shared buffers are protected by the
reservation. It won't be pretty, but the impact should be fairly
contained. Especially since many paths in ttm currently grab both locks
anyway, so wouldn't need any special handling.

I hope this helps to move the discussion forward.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch