[PATCH 2/5] drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep

Wed Jan 22 04:38:45 PST 2014

op 22-01-14 13:11, Thomas Hellstrom schreef:
> On 01/22/2014 11:58 AM, Maarten Lankhorst wrote:
>> op 22-01-14 11:27, Thomas Hellstrom schreef:
>>> On 01/22/2014 10:55 AM, Maarten Lankhorst wrote:
>>>> op 22-01-14 10:40, Thomas Hellstrom schreef:
>>>>> On 01/22/2014 09:19 AM, Maarten Lankhorst wrote:
>>>>>> op 21-01-14 18:44, Thomas Hellstrom schreef:
>>>>>>> On 01/21/2014 04:29 PM, Maarten Lankhorst wrote:
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> op 21-01-14 16:17, Thomas Hellstrom schreef:
>>>>>>>>> Maarten, for this and the other patches in this series,
>>>>>>>>>
>>>>>>>>> I seem to recall we have this discussion before?
>>>>>>>>> IIRC I stated that reservation was a too heavy-weight lock to
>>>>>>>>> hold to
>>>>>>>>> determine whether a buffer was idle? It's a pretty nasty thing to
>>>>>>>>> build in.
>>>>>>>>>
>>>>>>>> I've sent this patch after determining that this already didn't
>>>>>>>> end up
>>>>>>>> being heavyweight.
>>>>>>>> Most places were already using the fence_lock and reservation, I
>>>>>>>> just
>>>>>>>> fixed up the few
>>>>>>>> places that didn't hold a reservation while waiting. Converting the
>>>>>>>> few places that didn't
>>>>>>>> ended up being trivial, so I thought I'd submit it.
>>>>>>> Actually the only *valid* reason for holding a reservation when
>>>>>>> waiting
>>>>>>> for idle is
>>>>>>> 1) You want to block further command submission on the buffer.
>>>>>>> 2) You want to switch GPU engine and don't have access to gpu
>>>>>>> semaphores
>>>>>>> / barriers.
>>>>>>>
>>>>>>> Reservation has the nasty side effect that it blocks command
>>>>>>> submission
>>>>>>> and pins the buffer (in addition now makes the evict list traversals
>>>>>>> skip the buffer) which in general is *not* necessary for most wait
>>>>>>> cases, so we should instead actually convert the wait cases that
>>>>>>> don't
>>>>>>> fulfill 1) and 2) above in the other direction if we have
>>>>>>> performance
>>>>>>> and latency-reduction in mind. I can't see how a spinlock
>>>>>>> protecting a
>>>>>>> fence pointer or fence list is stopping you from using RW fences as
>>>>>>> long
>>>>>>> as the spinlock is held while manipulating the fence list?
>>>>>>>
>>>>>> You wish. Fine I'll enumerate all cases of ttm_bo_wait (with the
>>>>>> patchset, though) and enumerate if they can be changed to work
>>>>>> without
>>>>>> reservation or not.
>>>>>>
>>>>>> ttm/ttm_bo.c
>>>>>> ttm_bo_cleanup_refs_or_queue: needs reservation and ttm_bo_wait to
>>>>>> finish for the direct destroy fastpath, if either fails it needs
>>>>>> to be
>>>>>> queued. Cannot work without reservation.
>>>>> Doesn't block and no significant reservation contention expected.
>>>>>
>>>>>> ttm_bo_cleanup_refs_and_unlock: already drops reservation to wait,
>>>>>> doesn't need to re-acquire. Simply reordering ttm_bo_wait until after
>>>>>> re-reserve is enough.
>>>>> Currently follows the above rules.
>>>>>
>>>>>> ttm_bo_evict: already has the reservation, cannot be dropped since
>>>>>> only trylock is allowed. Dropping reservation would cause badness,
>>>>>> cannot be converted.
>>>>> Follows rule 2 above. We're about to move the buffer and if that can't
>>>>> be pipelined using the GPU (which TTM currently doesn't allow), we
>>>>> need
>>>>> to wait. Although eviction should be low priority compared to new
>>>>> command submission, so I can't really see why we couldn't wait before
>>>>> trying to reserve here?
>>>>>
>>>>>> ttm_bo_move_buffer: called from ttm_bo_validate, cannot drop
>>>>>> reservation for same reason as ttm_bo_evict. It might be part of a
>>>>>> ticketed reservation so really don't drop lock here.
>>>>> Part of command submission and as such follows rule 2 above. If we can
>>>>> pipeline the move with the GPU, no need to wait (but needs to be
>>>>> implemented, of course).
>>>>>
>>>>>> ttm_bo_synccpu_write_grab: the wait could be converted to be done
>>>>>> afterwards, without  fence_lock. But in this case reservation could
>>>>>> take the role of fence_lock too,
>>>>>>
>>>>>> so no separate fence_lock would be needed.
>>>>> With the exception that reservation is more likely to be contended.
>>>> True but rule 1.
>>>>>> ttm_bo_swapout: see ttm_bo_evict.
>>>>>>
>>>>>> ttm/ttm_bo_util.c:
>>>>>> ttm_bo_move_accel_cleanup: calls ttm_bo_wait, cannot drop lock, see
>>>>>> ttm_bo_move_buffer, can be called from that function.
>>>>> Rule 2.
>>>>>
>>>>>> ttm/ttm_bo_vm.c
>>>>>> ttm_bo_vm_fault_idle: I guess you COULD drop the reservation here,
>>>>>> but
>>>>>> you already had the reservation, so a similar optimization to
>>>>>> ttm_bo_synccpu_write_grab could be done without requiring fence_lock.
>>>>>> If you would write it like that, you would end up with a patch
>>>>>> similar
>>>>>> to drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep. I
>>>>>> think
>>>>>> we should do this, an
>>>>>>
>>>>>> Ok, so the core does NOT need fence_lock because we can never drop
>>>>>> reservations except in synccpu_write_grab and maybe
>>>>>> ttm_bo_vm_fault_idle, but even in those cases reservation is done. So
>>>>>> that could be used instead of fence_lock.
>>>>>>
>>>>>> nouveau_gem_ioctl_cpu_prep:
>>>>>> Either block on a global spinlock or a local reservation lock.
>>>>>> Doesn't
>>>>>> matter much which, I don't need the need to keep a global lock for
>>>>>> this function...
>>>>>> 2 cases can happen in the trylock reservation failure case: buffer is
>>>>>> not reserved, so it's not in the process of being evicted. buffer is
>>>>>> reserved, which means it's being used in command submission right
>>>>>> now,
>>>>>> or in one of the functions described above (eg not idle).
>>>>>>
>>>>>> nouveau_gem_pushbuf_reloc_apply:
>>>>>> has to call ttm_bo_wait with reservation, cannot be dropped.
>>>>>>
>>>>>> So for core ttm and nouveau the fence_lock is never needed, radeon
>>>>>> has
>>>>>> only 1 function that calls ttm_bo_wait which uses a reservation too.
>>>>>> It doesn't need the fence_lock either.
>>>>> And vmwgfx now also has a syccpu IOCTL (see drm-next).
>>>>>
>>>>> So assuming that we converted the functions that can be converted to
>>>>> wait outside of reservation, the same way you have done with Nouveau,
>>>>> leaving the ones that fall under 1) and 2) above, I would still argue
>>>>> that a spinlock should be used because taking a reservation may
>>>>> implicitly mean wait for gpu, and could give bad performance- and
>>>>> latency charateristics. You shouldn't need to wait for gpu to check
>>>>> for
>>>>> buffer idle.
>>>> Except that without reservation you can't tell if the buffer is really
>>>> idle, or is currently being
>>>> used as part of some command submission/eviction before the fence
>>>> pointer is set.
>>>>
>>> Yes, but when that matters, you're either in case 1 or case 2 again.
>>> Otherwise, when you release the reservation, you still don't know.
>>> A typical example of this is the vmwgfx synccpu ioctl, where you can
>>> either choose to block command submission (not used currently)
>>> or not (user-space inter-process synchronization). The former is a case
>>> 1 wait and holds reservation while waiting for idle and then ups
>>> cpu_writers. The latter waits without reservation for previously
>>> submitted rendering to finish.
>> Yeah you could, but what exactly are you waiting on then? If it's some
>> specific existing rendering,
>> I would argue that you should create an android userspace fence during
>> command submission,
>> or provide your own api to block on a specfic fence in userspace.
>>
>> If you don't then I think taking a reservation is not unreasonable. In
>> the most common case the buffer is
>> idle and not reserved, so it isn't contested. The actual waiting
>> itself can be done without reservation held,
>> by taking a reference on the fence.
> Yeah, here is where we disagree. I'm afraid people will start getting
> sloppy with reservations and use them to protect more stuff, and after a
> while they start wondering why the GPU command queue drains...
>
> Perhaps we could agree on a solution (building on one of your original
> ideas) where we require reservation to modify the fence pointers, and
> the buffer object moving flag, but the structure holding the fence
> pointer(s) is RCU safe, so that the pointers can be safely read under an
> rcu lock.
I think not modifying the fence pointer without reservation would be safest.
I also don't think readers need the capability to clear sync_obj, this might
simplify the implementation some.

But my preferred option is getting rid of sync_obj completely, and move to
using reservation_object->fence_shared/exclusive, like the incomplete proof
of concept conversion done in nouveau. But then I do need to grab the
reservation lock to touch things, because fences may be set by the i915 driver
I share the reservation_object with.

Alternatively could vmwgfx hold a spinlock when decrementing fence refcount instead?
Then we wouldn't need this in the core, and vmwgfx could use:

spin_lock(&vmw_fence_lock);
fence = ACCESS_ONCE(bo->sync_obj);
if (fence && !kref_get_unless_zero(&fence->ref)) fence = NULL;
spin_unlock(&vmw_fence_lock);

internally in that function, preserving old semantics but without unsetting sync_obj
if no reservation is held. Full rcu might be slightly overkill.

~Maarten