[igt-dev] [PATCH i-g-t v3 2/2] tests/i915/gem_ctx_exec: Skip test for shared reset domain

John Harrison john.c.harrison at intel.com
Wed Jan 19 19:25:34 UTC 2022


On 1/17/2022 17:54, Dixit, Ashutosh wrote:
> On Mon, 17 Jan 2022 00:57:00 -0800, <priyanka.dandamudi at intel.com> wrote:
>> @@ -292,11 +293,15 @@ static void nohangcheck_hostile(int i915)
>> 	ahnd = get_reloc_ahnd(i915, ctx->id);
>>
>> 	igt_require(__enable_hangcheck(dir, false));
>> +	has_sh_do = has_shared_reset_domain(i915, ctx);
>>
>> 	for_each_ctx_engine(i915, ctx, e) {
>> 		igt_spin_t *spin;
>> 		int new;
>>
>> +		if (has_sh_do && (e->class == I915_ENGINE_CLASS_RENDER ||
>> +			    e->class == I915_ENGINE_CLASS_COMPUTE))
>> +			continue;
> If we need to do this, let's not try to overly optimize things (which will
> clutter up the code since has_shared_reset_domain() may get called from
> multiple places) and just pass the engine into the function, something
> like:
>
> bool has_shared_reset_domain(int fd, struct intel_execution_engine2 *e2,
> 			     const intel_ctx_t *ctx)
> {
> 	const struct intel_execution_engine2 *e;
> 	int count = 0;
>
> 	if (!(e2->class == I915_ENGINE_CLASS_RENDER ||
> 		e2->class == I915_ENGINE_CLASS_COMPUTE))
> 		return false;
>
> 	for_each_ctx_engine(fd, ctx, e)
> 		if (e->class == I915_ENGINE_CLASS_RENDER ||
> 			e->class == I915_ENGINE_CLASS_COMPUTE)
> 		count++;
>
> 	return count >= 2;
> }
>
> And then in the loop above just have:
>
> 	if (has_shared_reset_domain(fd, e, ctx))
> 		continue;
>
> Also, I may be wrong, but it looks to me that this test should pass even if
> we have shared resets, the test is not even using engine resets. What
> failure are we seeing and on which platforms and with how many gt's? Does
> the test fail only on DG2 or also on other products with RCS + CCS? Can you
> paste the output after running the failing test?
>
> Any idea anyone why this test would fail with a shared reset domain? Could
> it be failing due to a different reason than a shared reset domain as
> people pointed out on previous versions of the patch? In that case this
> patch is incorrect.

Presumably the test is actually testing context persistence? If so, I 
would say the best option right now is to add an 
'is_persistance_fixed()' helper which currently returns false on modern 
platforms and causes any persistence related test to skip. The KMD 
implementation of persistence is broken. It vaguely hangs together for 
execlist mode with no dependent engines but it is totally broken for GuC 
submission (because part of the implementation is buried deep within the 
execlist backend) and the current design can never work for engines with 
dependent resets.

Once persistence has been completely re-written (and greatly simplified) 
in the KMD, we can then review the persistence related IGTs and decide 
whether each is still valid or is now pointless. I suspect this one 
would count as pointless. It is called 'nohangcheck' and seems to be 
testing persistence behaviour when the heartbeat is explicitly disabled. 
Part of the persistence re-write is to remove any dependence on the 
state of the heartbeat.

Without looking at this specific test in detail, I can't say exactly 
what might be happening here, but it seems likely that it is 
specifically a DG2 issue (and thus should only skip on DG2). The 
persistence implementation involves sending super high priority pings to 
engines. Assuming the workload on the target engine is pre-emptible, the 
ping should make it through. But if this is a DG2 and the target engine 
is CCS# then that pre-emption requires a pre-emption on RCS as well. If 
the test has put a non-pre-emtible background workload on RCS, then RCS 
is never going to switch, thus CCS can't switch either. Hence 
pre-emption timeouts, engine resets and test failures. In this case, the 
failure is not even specifically related to dependent engine resets.

One option is to make all the workloads pre-emptible but there is an 
argument that the point of the test is to ensure no interference at all 
including unwanted pre-emptions. Thus, the test is correctly failing 
because the hardware cannot provide context isolation.

John.



More information about the igt-dev mailing list