[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Fri Jul 29 15:23:01 UTC 2022

On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote:
> Hi Janusz,
> 
> On 29.07.2022 10:24, Janusz Krzysztofik wrote:
> > Hi Karolina,
> > 
> > On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
> >> Hi Janusz,
> >>
> >> Thanks a lot for taking a look at the patch.
> >>
> >> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
> >>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
> >>>> From: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>
> >>>> test_fence_await verifies if a fence used to pipeline work signals
> >>>> correctly. await-hang and nb-await-hang test cases inject GPU hang,
> >>>> which causes an erroneous state, meaning the fence will be signaled
> >>>> without execution. The error causes an instant reset of the command
> >>>> streamer for the hanging workload. This revealed a problem with how
> >>>> we verify the fence state and results. The test assumes that the
> >>>> error notification happens after a hang is declared, which takes
> >>>> longer than submitting the next set of fences, making the test pass
> >>>> every time. With the immediate reset, this might not happen, so the
> >>>> assertion fails, as there are no active fences in the GPU hang case.
> >>>>
> >>>> Move the check for active fence to the path for non-hanging workload,
> >>>> and verify results only in this scenario.
> >>>>
> >>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
> >>>> ---
> >>>>    tests/i915/gem_exec_fence.c | 14 ++++++++------
> >>>>    1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/gem_exec_fence.c
> >>>> index d46914c2..260aa82c 100644
> >>>> --- a/tests/i915/gem_exec_fence.c
> >>>> +++ b/tests/i915/gem_exec_fence.c
> >>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const
> > intel_ctx_t
> >>> *ctx,
> >>>>    	/* Long, but not too long to anger preemption disable checks */
> >>>>    	usleep(50 * 1000); /* 50 ms, typical preempt reset is 150+ms */
> >>>>    
> >>>> -	/* Check for invalidly completing the task early */
> >>>> -	igt_assert(fence_busy(spin->out_fence));
> >>>> -	for (int n = 0; n < i; n++)
> >>>> -		igt_assert_eq_u32(out[n], 0);
> >>>> +	if ((flags & HANG) == 0) {
> >>>> +		/* Check for invalidly completing the task early */
> >>>> +		igt_assert(fence_busy(spin->out_fence));
> >>>> +		for (int n = 0; n < i; n++)
> >>>> +			igt_assert_eq_u32(out[n], 0);
> >>>
> >>> AFAICU, in the 'hang' variant of the scenario we skip the above asserts
> >>> because the spin batch could have already hanged, then its out fence
> > already
> >>> signalled and store batches waiting for that signal already executed.  If
> >>> that's the case, how do this variant of gem_exec_fence test asserts that
> > the
> >>> fence actually worked as expected?
> >>
> >> With this change, yes, we would skip them. Still, the store batches
> >> wouldn't be executed, as we reset the CS on hang as a part of the error
> >> handling. For valid jobs, we expect to (1) see an active fence at the
> >> beginning of the request, (2) get a signaled fence after the wait, (3)
> >> store out[i] == i. With a hang, (1) and (3) would be false.
> >>
> >> In this particular loop, we could have garbage here with hang, not 0s
> >> (although, from my tests it looks like they are zeroed, but maybe I got
> >> lucky)
> > 
> > OK, so I missed the fact that the store batches won't be executed at all due
> > to reset of the whole command stream that also kills those batches.  But my
> > question is still valid: as soon as we omit those checks as not valid from
> > *await-hang variants, how do those variants still exercise fencing?  IOW, how
> > are those variants supposed to ever fail should something be wrong with i915
> > implementation of fencing specifically?
> 
> They would fail in the case where a hang happened but the fence is still 
> active, so it's this last assert you're referring to.
> 
> >>
> >>>>    
> >>>> -	if ((flags & HANG) == 0)
> >>>>    		igt_spin_end(spin);
> >>>> +	}
> >>>>    
> >>>>    	igt_waitchildren();
> >>>>    
> >>>>    	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
> >>>> -	while (i--)
> >>>> +	igt_assert(!fence_busy(spin->out_fence));
> >>>
> >>> We only check that the out fence of the presumably hanged spin batch no
> > longer
> >>> blocks execution of store batches.
> >>
> >> This check applies to all workloads, all of them should be done with
> >> work at this point
> > 
> > OK, but since that's the only explicit assert in the *-hang processing path,
> > does it tell us anything about fencing working or not?  
> 
> It says that we were given an active fence, we wait at it and hope it 
> signals when an error is reported. Like I said, we can't check the 
> results itself, as they are meaningless with the reset. If we have an 
> active fence at this point, that's bad, and the test should fail.
> 
> > I think it doesn't,
> > and as long as I'm not wrong, I think such scenarios hardly belong to
> > gem_exec_fence.  
> 
> Hm, I'm not sure if I follow, but this exact transition (from active -> 
> (record error) -> signal) is one of the possible scenarios we wish to 
> test. 

OK, so we check if an active fence is signalled on error.  But then, what does 
'active' mean here?  Do we consider a fence active as soon as it has been 
exported to userspace?  Or only after it has been imported back from userspace 
by at least one consumer?  Assuming the former (as I guess), what do we need 
the store batches for in these now modified *await-hang scenarios?  What extra 
value do those scenarios provide compared to (nb-)?wait-hang ?

Thanks,
Janusz

> Or, do you mean that this test case doesn't test 
> drm_i915_gem_exec_fence? This test suite exercises different scenarios 
> of using fences implemented with sync_files. Maybe this could be split 
> up, but these seem to be connected, so they ended up in one file.
> 
> > Otherwise, I think we should at least add descriptions of
> > those subtests, providing some information on what is actually exercised.
> 
> The hang cases reuse test_fence_await which has _some_ description in 
> the basic cases. But I agree, it would be nice to have more 
> documentation for other subtests, but it's out of scope of this fix.
> 
> Many thanks,
> Karolina
> 
> > Thanks,
> > Janusz
> > 
>