[Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

Tue Oct 26 19:48:24 UTC 2021

On 10/11/2021 16:47, Matthew Brost wrote:
> The hangcheck selftest blocks per engine resets by setting magic bits in
> the reset flags. This is incorrect for GuC submission because if the GuC
> fails to reset an engine we would like to do a full GT reset. Do no set
> these magic bits when using GuC submission.
>
> Side note this lockless algorithm with magic bits to block resets really
> should be ripped out.
As a first step, I am seeing a pointless BUILD_BUG but no BUILD_BUG at 
all for what really needs to be verified. Specifically, in 
intel_gt_handle_error, inside the engine reset loop, there is:
                         BUILD_BUG_ON(I915_RESET_MODESET >= 
I915_RESET_ENGINE);

Given that the above two values are explicit #defines of '1' and '2'. 
I'm not seeing any value to this assert. On the other hand, what I am 
not seeing anywhere is an assert that 'I915_RESET_ENGINE + max_engine_id 
< I915_WEDGED_ON_INIT'. That being the thing that would actually go 
horribly wrong if the engine count increased too far. Seems like there 
should be one of those in intel_engines_init_mmio, using 
ARRAY_SIZE(intel_engines) as the max id.

It looks like 'busy-reset' and 'reset-idle' parts of 'igt_ctx_sseu' in 
gem/selftests/i915_gem_context.c also mess around with these flags and 
then try to issue a manual engine reset. Presumably those tests are also 
going to have issues with GuC submission.

The workarounds, mocs and reset selftests also do strange things. Those 
ones did get updated to support GuC submission by not attempting manual 
engine resets but using the GuC based hang detection instead. However, 
it seems like they would also suffer the same deadlock scenario if the 
GuC based reset were to fail.

I'm wondering if the better fix is to remove the use of the 
I915_RESET_ENGINE flags completely when using GuC submission. So far as 
I can tell, they are only used (outside of selftest hackery) in 
intel_gt_handle_error to guard against multiple concurrent resets within 
i915. Guarding against multiple concurrent resets in GuC is the GuC's 
job. That leaves GuC based engine reset concurrent with i915 based full 
GT reset. But that is fine because the full GT reset toasts all engines 
AND the GuC. So it doesn't matter what GuC might or might not have been 
doing at the time. The only other issue is multiple concurrent full GT 
resets, but that is protected against by I915_RESET_BACKOFF.

So my thought is to add an 'if(!guc_submission)' wrapper around the set 
and clear of the reset flags immediately before/after the call to 
intel_gt_reset_global().

Fixing it there means the selftests can do what they like with the flags 
without causing problems for GuC submission. It also means being one 
step closer to removing the dodgy lockless locking completely, at least 
when using GuC submission.

John.

>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> ---
>   drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> index 7e2d99dd012d..90a03c60c80c 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> @@ -734,7 +734,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
>   		reset_engine_count = i915_reset_engine_count(global, engine);
>   
>   		st_engine_heartbeat_disable(engine);
> -		set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
> +		if (!using_guc)
> +			set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>   		count = 0;
>   		do {
>   			struct i915_request *rq = NULL;
> @@ -824,7 +825,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
>   			if (err)
>   				break;
>   		} while (time_before(jiffies, end_time));
> -		clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
> +		if (!using_guc)
> +			clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>   		st_engine_heartbeat_enable(engine);
>   		pr_info("%s: Completed %lu %s resets\n",
>   			engine->name, count, active ? "active" : "idle");
> @@ -1042,7 +1044,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
>   		yield(); /* start all threads before we begin */
>   
>   		st_engine_heartbeat_disable_no_pm(engine);
> -		set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
> +		if (!using_guc)
> +			set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>   		do {
>   			struct i915_request *rq = NULL;
>   			struct intel_selftest_saved_policy saved;
> @@ -1165,7 +1168,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
>   			if (err)
>   				break;
>   		} while (time_before(jiffies, end_time));
> -		clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
> +		if (!using_guc)
> +			clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>   		st_engine_heartbeat_enable_no_pm(engine);
>   
>   		pr_info("i915_reset_engine(%s:%s): %lu resets\n",