[Intel-gfx] [PATCH 08/24] drm/i915: Use per object locking in execbuf, v8.

Mon Apr 27 13:57:27 UTC 2020

On 21/04/2020 11:46, Maarten Lankhorst wrote:
> Now that we changed execbuf submission slightly to allow us to do all
> pinning in one place, we can now simply add ww versions on top of
> struct_mutex. All we have to do is a separate path for -EDEADLK
> handling, which needs to unpin all gem bo's before dropping the lock,
> then starting over.
> 
> This finally allows us to do parallel submission, but because not
> all of the pinning code uses the ww ctx yet, we cannot completely
> drop struct_mutex yet.
> 
> Changes since v1:
> - Keep struct_mutex for now. :(
> Changes since v2:
> - Make sure we always lock the ww context in slowpath.
> Changes since v3:
> - Don't call __eb_unreserve_vma in eb_move_to_gpu now; this can be
>    done on normal unlock path.
> - Unconditionally release vmas and context.
> Changes since v4:
> - Rebased on top of struct_mutex reduction.
> Changes since v5:
> - Remove training wheels.
> Changes since v6:
> - Fix accidentally broken -ENOSPC handling.
> Changes since v7:
> - Handle gt buffer pool better.
> 
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 330 ++++++++++--------
>   drivers/gpu/drm/i915/i915_gem.c               |   6 +
>   drivers/gpu/drm/i915/i915_gem.h               |   1 +
>   3 files changed, 195 insertions(+), 142 deletions(-)
> 

[snip]

> +static int eb_validate_vmas(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	INIT_LIST_HEAD(&eb->unbound);
> +
> +	for (i = 0; i < eb->buffer_count; i++) {
> +		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
> +		struct eb_vma *ev = &eb->vma[i];
> +		struct i915_vma *vma = ev->vma;
> +
> +		err = i915_gem_object_lock(vma->obj, &eb->ww);
> +		if (err)
> +			return err;
> +
> +		if (eb_pin_vma(eb, entry, ev)) {

Semi-random place to ask..

Why it is needed to hold all object locks across the whole execbuf?

If it only locked object at a time at this stage - just to pin each vma and the unlock it - that would make it safe to proceed unlocked until the final stage. Which as the today's eb can then lock all objects when updating the obj->resv. As long as it pinned all VMAs they are safe for relocations and so on.

I am thinking if this approach would improve the regressions in some IGTs. Worth a try?

Probably wouldn't help gem_exec_nop/parallel since that one is single object shared between N threads. Not yet sure why that one is hit so hard. Just because of the tracking ww_mutex adds on top of plain mutex? I definitely see more mutex spinning and more sleeping in case of this series which is perhaps unexpected:

tip:

real    0m25.103s
user    0m4.743s
sys     1m39.908s

series:

real    0m25.134s
user    0m1.586s
sys     0m32.950s

While at the same time the series has much worse throughput:

tip:

average (individually): 5.505us
rcs0: 2799616 cycles, 7.145us
vecs0: 2839552 cycles, 7.044us
vcs0: 2283520 cycles, 8.759us
bcs0: 1284096 cycles, 15.580us
vcs1: 2404352 cycles, 8.321us

series:

average (individually): 5.553us
bcs0: 481280 cycles, 41.589us
vcs0: 463872 cycles, 43.162us
vecs0: 460800 cycles, 43.452us
vcs1: 461824 cycles, 43.356us
rcs0: 457728 cycles, 43.746us

tip:

     3.93%  gem_exec_nop     [i915]                    [k] i915_gem_do_execbuffer
     3.59%  gem_exec_nop     [kernel.vmlinux]          [k] mutex_spin_on_owner

series:

     5.49%  gem_exec_nop     [kernel.vmlinux]          [k] mutex_spin_on_owner
     1.03%  gem_exec_nop     [i915]                    [k] i915_gem_do_execbuffer

More spinning, more sleeping, less throughput. And with this test it is effectively single struct_mutex vs single shared object lock. Suggesting ww_mutex is much worse in the contented use case by default, I mean with the single object only. Does that make sense?

[snip]

>   static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   {
>   	const unsigned int count = eb->buffer_count;
> -	struct ww_acquire_ctx acquire;
> -	unsigned int i;
> +	unsigned int i = count;
>   	int err = 0;
>   
> -	ww_acquire_init(&acquire, &reservation_ww_class);
> -
> -	for (i = 0; i < count; i++) {
> -		struct eb_vma *ev = &eb->vma[i];
> -		struct i915_vma *vma = ev->vma;
> -
> -		err = ww_mutex_lock_interruptible(&vma->resv->lock, &acquire);
> -		if (err == -EDEADLK) {
> -			GEM_BUG_ON(i == 0);
> -			do {
> -				int j = i - 1;
> -
> -				ww_mutex_unlock(&eb->vma[j].vma->resv->lock);
> -
> -				swap(eb->vma[i],  eb->vma[j]);
> -			} while (--i);

This pulling of the deadlocky obj to the head of array is not useful in the new scheme?

Regards,

Tvrtko