[igt-dev] [PATCH i-g-t] i915/hangman: Fix gt hang/error tests

Thu Aug 4 17:34:32 UTC 2022

On Thu, Aug 04, 2022 at 06:57:32PM +0200, Kamil Konieczny wrote:
>Hi Umesh,
>
>On 2022-07-29 at 12:27:41 -0700, Nerlige Ramappa, Umesh wrote:
>> From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>
>> gt-engine-hang and gt-engine-error tests were still using reset=2
>> setting so that ended up just doing an engine reset. Fix the tests so
>> that they actually do the gt specific reset and set the expectations
>> correctly.
>>
>> In some rare failures, some of the background spinners show up as
>> innocent and keep spinning after the hang recovery. This can only happen
>> if the spinners did not start for some reason. Check that the spinners
>> actually started before submitting a hanging batch.
>
>imho this is a fix, it is worth to split into separate patch.
>
>>
>> For engine specific hang, only one context is marked guilty, but for a
>> gt hang all contexts are marked guilty. Check for different expected
>> behavior for engine vs. gt reset.
>
>This is true only for GuC submission on new platforms.
>When execlist is used, these tests are the same as engine-engine-*
>as you noted at beginning.
>
>>
>> v2:
>> - gt-reset resets all contexts on all engines. The execlist implementation
>>   of gt-engine-* tests expected that a preemptible background spinner
>>   running on the target engine should be marked innocent. While i915 can
>>   mark such a context for execlist mode, GuC scheduling does not guarantee
>>   that the background spinner can be marked as innocent. Since the state
>>   of the background spinner depends on the scheduling backend, do no
>>   validate the state of the background spinner for the target engine.
>>
>> v3:
>> - omit checking background spinner state only for gt-reset case
>>
>> v4: (Kamil)
>> - check has_gpu_reset and gem_scheduler_has_preemption for the
>>   gt-engine-hang and gt-engine-error tests.
>>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> ---
>>  tests/i915/i915_hangman.c | 75 +++++++++++++++++++++++++++++++--------
>>  1 file changed, 60 insertions(+), 15 deletions(-)
>>
>> diff --git a/tests/i915/i915_hangman.c b/tests/i915/i915_hangman.c
>> index c7d69fdd..bdf47bc0 100644
>> --- a/tests/i915/i915_hangman.c
>> +++ b/tests/i915/i915_hangman.c
>> @@ -295,9 +295,15 @@ static void context_unban(int fd, unsigned ctx)
>>  	gem_context_set_param(fd, &param);
>>  }
>>
>> +enum reset_type {
>> +	GT_RESET = 1,
>> +	ENGINE_RESET = 2,
>> +};
>> +
>>  static void
>>  test_engine_hang(const intel_ctx_t *ctx,
>> -		 const struct intel_execution_engine2 *e, unsigned int flags)
>> +		 const struct intel_execution_engine2 *e, unsigned int flags,
>> +		 enum reset_type reset)
>
>Using bool as param are no good, imho we should split this into
>two functions, see below after your comment on
>egine-engine-hang.
>
>>  {
>>  	const struct intel_execution_engine2 *other;
>>  	const intel_ctx_t *local_ctx[GEM_MAX_ENGINES];
>> @@ -309,6 +315,7 @@ test_engine_hang(const intel_ctx_t *ctx,
>>  	igt_skip_on(flags & IGT_SPIN_INVALID_CS &&
>>  		    gem_engine_has_cmdparser(device, &ctx->cfg, e->flags));
>>
>> +	igt_assert(reset == GT_RESET || reset == ENGINE_RESET);
>>  	/*
>>  	 * Fill all engines with background load.
>>  	 * This verifies that independent engines are unaffected and gives
>> @@ -324,7 +331,10 @@ test_engine_hang(const intel_ctx_t *ctx,
>>  				      .ahnd = ahndN,
>>  				      .ctx = local_ctx[num_ctx],
>>  				      .engine = other->flags,
>> -				      .flags = IGT_SPIN_FENCE_OUT);
>> +				      .flags = IGT_SPIN_FENCE_OUT |
>> +					       IGT_SPIN_POLL_RUN);
>> +		igt_spin_busywait_until_started(spin);
>> +
>
>This is a fix, please split it to separate patch.
>
>>  		num_ctx++;
>>
>>  		igt_list_move(&spin->link, &list);
>> @@ -344,14 +354,44 @@ test_engine_hang(const intel_ctx_t *ctx,
>>  	igt_assert_eq(sync_fence_status(spin->out_fence), -EIO);
>>  	igt_spin_free(device, spin);
>>
>> -	/* But no other engines/clients should be affected */
>> -	igt_list_for_each_entry_safe(spin, next, &list, link) {
>> +	/*
>> +	 * engine-engine-hang: Other engines/clients should not be affected for
>> +	 * engine reset, so innocent contexts complete successfully once the
>> +	 * spinner is ended.
>> +	 *
>> +	 * gt-engine-hang: All engines/clients are guilty and complete with a
>> +	 * -EIO fence status, however the background task that was submitted to
>> +	 * the target engine is innocent and is expected to complete
>> +	 * successfully.
>> +	 */
>
>From this description maybe it it worth to split this function
>into two separate functions : gt_engine_hang and engine_engine_hang.
>Latter one will be old and in former one we could add different
>bahaviour. I will add Janusz to Cc for this topic.
>
>> +	igt_list_for_each_entry_safe_reverse(spin, next, &list, link) {
>
>Why do you change this into _reverse ?

Just retaining whatever was used earlier.

>
>> +		bool innocent = reset == ENGINE_RESET;
>
>When we split functionality then we can remove this check.

yes.

>
>> +		int expect = innocent ? 1 : -EIO;
>> +
>>  		ahndN = spin->opts.ahnd;
>> -		igt_assert(sync_fence_wait(spin->out_fence, 0) == -ETIME);
>> -		igt_spin_end(spin);
>> +		if (innocent) {
>> +			igt_assert_eq(sync_fence_wait(spin->out_fence, 0), -ETIME);
>> +			igt_spin_end(spin);
>> +		}
>>
>> -		igt_assert(sync_fence_wait(spin->out_fence, 500) == 0);
>> -		igt_assert_eq(sync_fence_status(spin->out_fence), 1);
>> +		if (reset == GT_RESET && spin->opts.engine == e->flags) {
>---------------------------------------  ^
>Here you will check execlist or GuC ? This shows we need to split
>functionality to avoid logic errors and further simplify our
>checks.
>
>> +			/*
>> +			 * gt-reset resets everything. The execlist
>> +			 * implementation of gt-engine-* tests expect that a
>> +			 * preemptible background spinner running on the target
>> +			 * engine should be marked innocent. While i915 can mark
>> +			 * such a context innocent for execlist mode, GuC
>> +			 * scheduling does not guarantee that the background
>> +			 * spinner can be marked as innocent. Since the state of
>> +			 * the background spinner depends on the scheduling
>> +			 * backend, do no validate the state of the background
>> +			 * spinner for the target engine when doing a gt-reset.
>> +			 */
>> +			igt_spin_end(spin);
>> +		} else {
>> +			igt_assert_eq(sync_fence_wait(spin->out_fence, 500), 0);
>> +			igt_assert_eq(sync_fence_status(spin->out_fence), expect);
>> +		}
>>  		igt_spin_free(device, spin);
>>  		put_ahnd(ahndN);
>>  	}
>> @@ -439,6 +479,10 @@ static void do_tests(const char *name, const char *prefix,
>>  {
>>  	const struct intel_execution_engine2 *e;
>>  	char buff[256];
>> +	enum reset_type reset = ENGINE_RESET;
>> +
>> +	if (!strncmp(prefix, "gt", 2))
>> +		reset = GT_RESET;
>>
>>  	snprintf(buff, sizeof(buff), "Per engine error capture (%s reset)", name);
>>  	igt_describe(buff);
>> @@ -461,13 +505,13 @@ static void do_tests(const char *name, const char *prefix,
>>  		};
>>
>>  		igt_require(gem_scheduler_has_preemption(device));
>> -		igt_params_set(device, "reset", "%u", -1);
>>                  ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>> -		igt_require(has_gpu_reset > 1);
>> +		igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
>> +			    (reset == ENGINE_RESET && has_gpu_reset > 1));
>
>Move above checks into separate function (avoid code duplication),
>so here:
>		check_reset(reset);
>and
>check_reset(enum reset_type reset)
>{
>	...put data declarations here...
>
>	ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>	igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
>		    (reset == ENGINE_RESET && has_gpu_reset > 1));
>}
>
>>
>>  		for_each_ctx_engine(device, ctx, e) {
>>  			igt_dynamic_f("%s", e->name)
>> -				test_engine_hang(ctx, e, 0);
>> +				test_engine_hang(ctx, e, 0, reset);
>
>It is better to make this into:
>				gt_reset ? gt_engine_hang(ctx, e, 0) : engine_engine_hang(ctx, e, 0);
>
>and before that set:
>
>	bool gt_reset = reset == GT_RESET && gem_using_guc_submission(device) &&
>				 intel_gen(intel_get_drm_devid(i915)) >= 12;
>
>imho we want to check both old execlist and new GuC behaviour.
>
>>  		}
>>  	}
>>
>> @@ -481,13 +525,14 @@ static void do_tests(const char *name, const char *prefix,
>>  			.value = &has_gpu_reset,
>>  		};
>>
>> -		igt_params_set(device, "reset", "%u", -1);
>> -		ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>> -		igt_require(has_gpu_reset > 1);
>> +		igt_require(gem_scheduler_has_preemption(device));
>
>Why is it needed here ?

The reset mechanisms rely on periodic preemption. If preemption is not 
available, then the resets will not work, so we need to check this. This 
is also pulled from older code. Without this SNB CI fails for these 
tests. 

>
>> +                ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>-- ^
>Use tabs, not spaces here.
>
>> +		igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
>> +			    (reset == ENGINE_RESET && has_gpu_reset > 1));
>
>Instead of above use check_reset(reset) here.
>
>>
>>  		for_each_ctx_engine(device, ctx, e) {
>>  			igt_dynamic_f("%s", e->name)
>> -				test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS);
>> +				test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS, reset);
>
>Same here, call gt_engine_hang or engine_engine_hang instead
>(see above). What happens when there are no preemption ?
>
>>  		}
>>  	}
>>  }

overall, I agree that the test would need to be refactored to separate 
out the gt/engine tests as well as execlist/guc checks. That would make 
it better to maintain. I would do that as a follow up since this test is 
still needed for BAT.

Thanks,
Umesh

>> --
>> 2.36.1
>>
>
>Regards,
>Kamil
>