[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Fri Jul 29 08:24:04 UTC 2022

Hi Karolina,

On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
> Hi Janusz,
> 
> Thanks a lot for taking a look at the patch.
> 
> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
> > On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
> >> From: Chris Wilson <chris at chris-wilson.co.uk>
> >>
> >> test_fence_await verifies if a fence used to pipeline work signals
> >> correctly. await-hang and nb-await-hang test cases inject GPU hang,
> >> which causes an erroneous state, meaning the fence will be signaled
> >> without execution. The error causes an instant reset of the command
> >> streamer for the hanging workload. This revealed a problem with how
> >> we verify the fence state and results. The test assumes that the
> >> error notification happens after a hang is declared, which takes
> >> longer than submitting the next set of fences, making the test pass
> >> every time. With the immediate reset, this might not happen, so the
> >> assertion fails, as there are no active fences in the GPU hang case.
> >>
> >> Move the check for active fence to the path for non-hanging workload,
> >> and verify results only in this scenario.
> >>
> >> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
> >> ---
> >>   tests/i915/gem_exec_fence.c | 14 ++++++++------
> >>   1 file changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/gem_exec_fence.c
> >> index d46914c2..260aa82c 100644
> >> --- a/tests/i915/gem_exec_fence.c
> >> +++ b/tests/i915/gem_exec_fence.c
> >> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const 
intel_ctx_t
> > *ctx,
> >>   	/* Long, but not too long to anger preemption disable checks */
> >>   	usleep(50 * 1000); /* 50 ms, typical preempt reset is 150+ms */
> >>   
> >> -	/* Check for invalidly completing the task early */
> >> -	igt_assert(fence_busy(spin->out_fence));
> >> -	for (int n = 0; n < i; n++)
> >> -		igt_assert_eq_u32(out[n], 0);
> >> +	if ((flags & HANG) == 0) {
> >> +		/* Check for invalidly completing the task early */
> >> +		igt_assert(fence_busy(spin->out_fence));
> >> +		for (int n = 0; n < i; n++)
> >> +			igt_assert_eq_u32(out[n], 0);
> > 
> > AFAICU, in the 'hang' variant of the scenario we skip the above asserts
> > because the spin batch could have already hanged, then its out fence 
already
> > signalled and store batches waiting for that signal already executed.  If
> > that's the case, how do this variant of gem_exec_fence test asserts that 
the
> > fence actually worked as expected?
> 
> With this change, yes, we would skip them. Still, the store batches
> wouldn't be executed, as we reset the CS on hang as a part of the error 
> handling. For valid jobs, we expect to (1) see an active fence at the 
> beginning of the request, (2) get a signaled fence after the wait, (3) 
> store out[i] == i. With a hang, (1) and (3) would be false.
> 
> In this particular loop, we could have garbage here with hang, not 0s 
> (although, from my tests it looks like they are zeroed, but maybe I got 
> lucky)

OK, so I missed the fact that the store batches won't be executed at all due 
to reset of the whole command stream that also kills those batches.  But my 
question is still valid: as soon as we omit those checks as not valid from 
*await-hang variants, how do those variants still exercise fencing?  IOW, how 
are those variants supposed to ever fail should something be wrong with i915 
implementation of fencing specifically?

> 
> >>   
> >> -	if ((flags & HANG) == 0)
> >>   		igt_spin_end(spin);
> >> +	}
> >>   
> >>   	igt_waitchildren();
> >>   
> >>   	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
> >> -	while (i--)
> >> +	igt_assert(!fence_busy(spin->out_fence));
> > 
> > We only check that the out fence of the presumably hanged spin batch no 
longer
> > blocks execution of store batches.
> 
> This check applies to all workloads, all of them should be done with 
> work at this point

OK, but since that's the only explicit assert in the *-hang processing path, 
does it tell us anything about fencing working or not?  I think it doesn't, 
and as long as I'm not wrong, I think such scenarios hardly belong to 
gem_exec_fence.  Otherwise, I think we should at least add descriptions of 
those subtests, providing some information on what is actually exercised.

Thanks,
Janusz

> 
> >> +	while ((flags & HANG) == 0 && i--)
> > 
> > Besides, why don't we at least assert successful results of store batches?  
Do
> > we expect them not having their job done correctly when completed after 
the
> > hang of the spin batch have occurred?
> 
> We don't expect them to store anything meaningful, because we get a
> reset. So, this check only applies to well-behaved jobs.
> 
> All the best,
> Karolina
> 
> > Am I missing something?
> > 
> > Thanks,
> > Janusz
> > 
> > 
> >>   		igt_assert_eq_u32(out[i], i);
> >>   	munmap(out, 4096);
> >>   
> >>
> > 
> > 
> > 
> > 
>