[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Mon Aug 1 14:43:43 UTC 2022

On Monday, 1 August 2022 15:39:36 CEST Karolina Drobnik wrote:
> Hi,
> 
> On 01.08.2022 13:54, Janusz Krzysztofik wrote:
> > On Monday, 1 August 2022 11:51:27 CEST Karolina Drobnik wrote:
> >> Hi,
> >>
> >> On 01.08.2022 10:11, Janusz Krzysztofik wrote:
> >>> On Monday, 1 August 2022 07:53:54 CEST Karolina Drobnik wrote:
> >>>> Hi,
> >>>>
> >>>> On 29.07.2022 17:23, Janusz Krzysztofik wrote:
> >>>>> On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote:
> >>>>>> Hi Janusz,
> >>>>>>
> >>>>>> On 29.07.2022 10:24, Janusz Krzysztofik wrote:
> >>>>>>> Hi Karolina,
> >>>>>>>
> >>>>>>> On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
> >>>>>>>> Hi Janusz,
> >>>>>>>>
> >>>>>>>> Thanks a lot for taking a look at the patch.
> >>>>>>>>
> >>>>>>>> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
> >>>>>>>>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
> >>>>>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>>>
> >>>>>>>>>> test_fence_await verifies if a fence used to pipeline work signals
> >>>>>>>>>> correctly. await-hang and nb-await-hang test cases inject GPU hang,
> >>>>>>>>>> which causes an erroneous state, meaning the fence will be signaled
> >>>>>>>>>> without execution. The error causes an instant reset of the command
> >>>>>>>>>> streamer for the hanging workload. This revealed a problem with how
> >>>>>>>>>> we verify the fence state and results. The test assumes that the
> >>>>>>>>>> error notification happens after a hang is declared, which takes
> >>>>>>>>>> longer than submitting the next set of fences, making the test pass
> >>>>>>>>>> every time. With the immediate reset, this might not happen, so the
> >>>>>>>>>> assertion fails, as there are no active fences in the GPU hang case.
> >>>>>>>>>>
> >>>>>>>>>> Move the check for active fence to the path for non-hanging workload,
> >>>>>>>>>> and verify results only in this scenario.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       tests/i915/gem_exec_fence.c | 14 ++++++++------
> >>>>>>>>>>       1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/
> >>> gem_exec_fence.c
> >>>>>>>>>> index d46914c2..260aa82c 100644
> >>>>>>>>>> --- a/tests/i915/gem_exec_fence.c
> >>>>>>>>>> +++ b/tests/i915/gem_exec_fence.c
> >>>>>>>>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const
> >>>>>>> intel_ctx_t
> >>>>>>>>> *ctx,
> >>>> (...)
> >>>>>>>>>>       
> >>>>>>>>>> -	if ((flags & HANG) == 0)
> >>>>>>>>>>       		igt_spin_end(spin);
> >>>>>>>>>> +	}
> >>>>>>>>>>       
> >>>>>>>>>>       	igt_waitchildren();
> >>>>>>>>>>       
> >>>>>>>>>>       	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
> >>>>>>>>>> -	while (i--)
> >>>>>>>>>> +	igt_assert(!fence_busy(spin->out_fence));
> >>>>>>>>>
> >>>>>>>>> We only check that the out fence of the presumably hanged spin batch
> >>> no
> >>>>>>> longer
> >>>>>>>>> blocks execution of store batches.
> >>>>>>>>
> >>>>>>>> This check applies to all workloads, all of them should be done with
> >>>>>>>> work at this point
> >>>>>>>
> >>>>>>> OK, but since that's the only explicit assert in the *-hang processing
> >>> path,
> >>>>>>> does it tell us anything about fencing working or not?
> >>>>>>
> >>>>>> It says that we were given an active fence, we wait at it and hope it
> >>>>>> signals when an error is reported. Like I said, we can't check the
> >>>>>> results itself, as they are meaningless with the reset. If we have an
> >>>>>> active fence at this point, that's bad, and the test should fail.
> >>>>>>
> >>>>>>> I think it doesn't,
> >>>>>>> and as long as I'm not wrong, I think such scenarios hardly belong to
> >>>>>>> gem_exec_fence.
> >>>>>>
> >>>>>> Hm, I'm not sure if I follow, but this exact transition (from active ->
> >>>>>> (record error) -> signal) is one of the possible scenarios we wish to
> >>>>>> test.
> >>>>>
> >>>>> OK, so we check if an active fence is signalled on error.  But then, what
> >>> does
> >>>>> 'active' mean here?
> >>>>
> >>>> Active means that the fence was sent and we wait to hear back from the
> >>>> GPU. If you'd like to see what are the state transitions in fences, you
> >>>> can check out Mainline Explicit Fencing series of blog posts/presentation.
> >>>>
> >>>>> Do we consider a fence active as soon as it has been
> >>>>> exported to userspace?  Or only after it has been imported back from
> >>> userspace
> >>>>> by at least one consumer?
> >>>>
> >>>> We assume them to be active as soon as the fence's fd is returned from
> >>>> the driver to userspace (we deal with out-fences here).
> >>>>
> >>>>> Assuming the former (as I guess), what do we need
> >>>>> the store batches for in these now modified *await-hang scenarios?  What
> >>> extra
> >>>>> value do those scenarios provide compared to (nb-)?wait-hang ?
> >>>>
> >>>> Hm, I don't quite understand the question here -- are you asking why do
> >>>> we check the store results? test_fence_await is reused by other
> >>>> subtests, not only the hanging cases, and we expect to see the stores.
> >>>
> >>> The patch introduces two processing paths to test_fence_await(), one of them
> >>> for *-hang variants.  In that *-hang processing path, why do we still submit
> >>> batches that depend on the fence if we don't examine their execution and
> >>> results, only their completion?
> >>
> >> Ah, OK, now I get it. Do you suggest that we should also add a check for
> >> HANG flag for igt_store_word block? But we still need to send something
> >> to test this case.
> > 
> > Why do we need to send something?  What that something should look like?  IOW,
> > what are we trying to exercise with this case?  How is it supposed to be
> > different from wait-hang scenario?
> 
> test_fence_await tests pipelining work (here, writing something to 
> memory in parallel) and cross-checks its status with what's reported by 
> the fence, 

True for *await (no HANG) subtests, but in *await-hang variants I still can't 
recognize checks in place as cross-checks and what value they add compared to 
wait-hang.  Does a check for invalidated work no longer queued nor running 
tell us anything about the fence behavior, i.e., if and how it affected / 
interacted with that work?  I think it doesn't, then, I'm not sure if we still 
need such limited scenarios.  Please convince me that we do and I'll be happy 
to push the series with my S-o-b: added.

Thanks,
Janusz 

> whereas test_fence_busy checks if the fence signals to 
> userspace after hang (a more general case). I'm sorry, I didn't make 
> this distinction clear in the beginning.
> 
> >>
> >>> They get completed regardless of fencing working or not, I believe, > then that part of the *-hang processing path
> >>> related to store batches seems useless for me in a test focused on fencing.
> >>
> >> Do they get completed? We'll get a CS reset and this batch won't
> >> complete.
> > 
> > OK, by completed I meant no longer queued nor running.
> 
> OK, I see -- yes, that's correct
> 
> Thanks,
> Karolina
> 
> >> Yes, in the case of hangs we just check the fence,
> > 
> > Again, how this is supposed to be different from wait-hang scenario?
> > 
> > Thanks,
> > Janusz
> > 
> >> but in
> >> general we do care about these stores. This patch aims to fix the
> >> problem uncovered during manual testing (*-hang variants are not run in
> >> CI), it does not try to rewrite the test (because that's what would be
> >> required to completely separate these two test cases).
> >>
> >> Many thanks,
> >> Karolina
> >>
> >>> Thanks,
> >>> Janusz
> >>>
> >>>>
> >>>> Many thanks,
> >>>> Karolina
> >>>>
> >>>>> Thanks,
> >>>>> Janusz
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> > 
> > 
> > 
> > 
>