[Intel-gfx] [PATCH 08/24] drm/i915: Use per object locking in execbuf, v8.
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Mon Apr 27 13:57:27 UTC 2020
On 21/04/2020 11:46, Maarten Lankhorst wrote:
> Now that we changed execbuf submission slightly to allow us to do all
> pinning in one place, we can now simply add ww versions on top of
> struct_mutex. All we have to do is a separate path for -EDEADLK
> handling, which needs to unpin all gem bo's before dropping the lock,
> then starting over.
>
> This finally allows us to do parallel submission, but because not
> all of the pinning code uses the ww ctx yet, we cannot completely
> drop struct_mutex yet.
>
> Changes since v1:
> - Keep struct_mutex for now. :(
> Changes since v2:
> - Make sure we always lock the ww context in slowpath.
> Changes since v3:
> - Don't call __eb_unreserve_vma in eb_move_to_gpu now; this can be
> done on normal unlock path.
> - Unconditionally release vmas and context.
> Changes since v4:
> - Rebased on top of struct_mutex reduction.
> Changes since v5:
> - Remove training wheels.
> Changes since v6:
> - Fix accidentally broken -ENOSPC handling.
> Changes since v7:
> - Handle gt buffer pool better.
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> ---
> .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 330 ++++++++++--------
> drivers/gpu/drm/i915/i915_gem.c | 6 +
> drivers/gpu/drm/i915/i915_gem.h | 1 +
> 3 files changed, 195 insertions(+), 142 deletions(-)
>
[snip]
> +static int eb_validate_vmas(struct i915_execbuffer *eb)
> +{
> + unsigned int i;
> + int err;
> +
> + INIT_LIST_HEAD(&eb->unbound);
> +
> + for (i = 0; i < eb->buffer_count; i++) {
> + struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
> + struct eb_vma *ev = &eb->vma[i];
> + struct i915_vma *vma = ev->vma;
> +
> + err = i915_gem_object_lock(vma->obj, &eb->ww);
> + if (err)
> + return err;
> +
> + if (eb_pin_vma(eb, entry, ev)) {
Semi-random place to ask..
Why it is needed to hold all object locks across the whole execbuf?
If it only locked object at a time at this stage - just to pin each vma and the unlock it - that would make it safe to proceed unlocked until the final stage. Which as the today's eb can then lock all objects when updating the obj->resv. As long as it pinned all VMAs they are safe for relocations and so on.
I am thinking if this approach would improve the regressions in some IGTs. Worth a try?
Probably wouldn't help gem_exec_nop/parallel since that one is single object shared between N threads. Not yet sure why that one is hit so hard. Just because of the tracking ww_mutex adds on top of plain mutex? I definitely see more mutex spinning and more sleeping in case of this series which is perhaps unexpected:
tip:
real 0m25.103s
user 0m4.743s
sys 1m39.908s
series:
real 0m25.134s
user 0m1.586s
sys 0m32.950s
While at the same time the series has much worse throughput:
tip:
average (individually): 5.505us
rcs0: 2799616 cycles, 7.145us
vecs0: 2839552 cycles, 7.044us
vcs0: 2283520 cycles, 8.759us
bcs0: 1284096 cycles, 15.580us
vcs1: 2404352 cycles, 8.321us
series:
average (individually): 5.553us
bcs0: 481280 cycles, 41.589us
vcs0: 463872 cycles, 43.162us
vecs0: 460800 cycles, 43.452us
vcs1: 461824 cycles, 43.356us
rcs0: 457728 cycles, 43.746us
tip:
3.93% gem_exec_nop [i915] [k] i915_gem_do_execbuffer
3.59% gem_exec_nop [kernel.vmlinux] [k] mutex_spin_on_owner
series:
5.49% gem_exec_nop [kernel.vmlinux] [k] mutex_spin_on_owner
1.03% gem_exec_nop [i915] [k] i915_gem_do_execbuffer
More spinning, more sleeping, less throughput. And with this test it is effectively single struct_mutex vs single shared object lock. Suggesting ww_mutex is much worse in the contented use case by default, I mean with the single object only. Does that make sense?
[snip]
> static int eb_move_to_gpu(struct i915_execbuffer *eb)
> {
> const unsigned int count = eb->buffer_count;
> - struct ww_acquire_ctx acquire;
> - unsigned int i;
> + unsigned int i = count;
> int err = 0;
>
> - ww_acquire_init(&acquire, &reservation_ww_class);
> -
> - for (i = 0; i < count; i++) {
> - struct eb_vma *ev = &eb->vma[i];
> - struct i915_vma *vma = ev->vma;
> -
> - err = ww_mutex_lock_interruptible(&vma->resv->lock, &acquire);
> - if (err == -EDEADLK) {
> - GEM_BUG_ON(i == 0);
> - do {
> - int j = i - 1;
> -
> - ww_mutex_unlock(&eb->vma[j].vma->resv->lock);
> -
> - swap(eb->vma[i], eb->vma[j]);
> - } while (--i);
This pulling of the deadlocky obj to the head of array is not useful in the new scheme?
Regards,
Tvrtko
More information about the Intel-gfx
mailing list