[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Wed Aug 3 07:21:03 UTC 2022

On Tuesday, 2 August 2022 12:20:29 CEST Karolina Drobnik wrote:
> Hi,
> 
> On 01.08.2022 16:43, Janusz Krzysztofik wrote:
> > On Monday, 1 August 2022 15:39:36 CEST Karolina Drobnik wrote:
> >> Hi,
> >>
> >> On 01.08.2022 13:54, Janusz Krzysztofik wrote:
> >>> On Monday, 1 August 2022 11:51:27 CEST Karolina Drobnik wrote:
> >>>> Hi,
> >>>>
> >>>> On 01.08.2022 10:11, Janusz Krzysztofik wrote:
> >>>>> On Monday, 1 August 2022 07:53:54 CEST Karolina Drobnik wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 29.07.2022 17:23, Janusz Krzysztofik wrote:
> >>>>>>> On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote:
> >>>>>>>> Hi Janusz,
> >>>>>>>>
> >>>>>>>> On 29.07.2022 10:24, Janusz Krzysztofik wrote:
> >>>>>>>>> Hi Karolina,
> >>>>>>>>>
> >>>>>>>>> On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
> >>>>>>>>>> Hi Janusz,
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for taking a look at the patch.
> >>>>>>>>>>
> >>>>>>>>>> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
> >>>>>>>>>>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
> >>>>>>>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>>>>>
> >>>>>>>>>>>> test_fence_await verifies if a fence used to pipeline work 
signals
> >>>>>>>>>>>> correctly. await-hang and nb-await-hang test cases inject GPU 
hang,
> >>>>>>>>>>>> which causes an erroneous state, meaning the fence will be 
signaled
> >>>>>>>>>>>> without execution. The error causes an instant reset of the 
command
> >>>>>>>>>>>> streamer for the hanging workload. This revealed a problem with 
how
> >>>>>>>>>>>> we verify the fence state and results. The test assumes that 
the
> >>>>>>>>>>>> error notification happens after a hang is declared, which 
takes
> >>>>>>>>>>>> longer than submitting the next set of fences, making the test 
pass
> >>>>>>>>>>>> every time. With the immediate reset, this might not happen, so 
the
> >>>>>>>>>>>> assertion fails, as there are no active fences in the GPU hang 
case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Move the check for active fence to the path for non-hanging 
workload,
> >>>>>>>>>>>> and verify results only in this scenario.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>>>>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>        tests/i915/gem_exec_fence.c | 14 ++++++++------
> >>>>>>>>>>>>        1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/
> >>>>> gem_exec_fence.c
> >>>>>>>>>>>> index d46914c2..260aa82c 100644
> >>>>>>>>>>>> --- a/tests/i915/gem_exec_fence.c
> >>>>>>>>>>>> +++ b/tests/i915/gem_exec_fence.c
> >>>>>>>>>>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, 
const
> >>>>>>>>> intel_ctx_t
> >>>>>>>>>>> *ctx,
> >>>>>> (...)
> >>>>>>>>>>>>        
> >>>>>>>>>>>> -	if ((flags & HANG) == 0)
> >>>>>>>>>>>>        		igt_spin_end(spin);
> >>>>>>>>>>>> +	}
> >>>>>>>>>>>>        
> >>>>>>>>>>>>        	igt_waitchildren();
> >>>>>>>>>>>>        
> >>>>>>>>>>>>        	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
> >>>>>>>>>>>> -	while (i--)
> >>>>>>>>>>>> +	igt_assert(!fence_busy(spin->out_fence));
> >>>>>>>>>>>
> >>>>>>>>>>> We only check that the out fence of the presumably hanged spin 
batch
> >>>>> no
> >>>>>>>>> longer
> >>>>>>>>>>> blocks execution of store batches.
> >>>>>>>>>>
> >>>>>>>>>> This check applies to all workloads, all of them should be done 
with
> >>>>>>>>>> work at this point
> >>>>>>>>>
> >>>>>>>>> OK, but since that's the only explicit assert in the *-hang 
processing
> >>>>> path,
> >>>>>>>>> does it tell us anything about fencing working or not?
> >>>>>>>>
> >>>>>>>> It says that we were given an active fence, we wait at it and hope 
it
> >>>>>>>> signals when an error is reported. Like I said, we can't check the
> >>>>>>>> results itself, as they are meaningless with the reset. If we have 
an
> >>>>>>>> active fence at this point, that's bad, and the test should fail.
> >>>>>>>>
> >>>>>>>>> I think it doesn't,
> >>>>>>>>> and as long as I'm not wrong, I think such scenarios hardly belong 
to
> >>>>>>>>> gem_exec_fence.
> >>>>>>>>
> >>>>>>>> Hm, I'm not sure if I follow, but this exact transition (from 
active ->
> >>>>>>>> (record error) -> signal) is one of the possible scenarios we wish 
to
> >>>>>>>> test.
> >>>>>>>
> >>>>>>> OK, so we check if an active fence is signalled on error.  But then, 
what
> >>>>> does
> >>>>>>> 'active' mean here?
> >>>>>>
> >>>>>> Active means that the fence was sent and we wait to hear back from 
the
> >>>>>> GPU. If you'd like to see what are the state transitions in fences, 
you
> >>>>>> can check out Mainline Explicit Fencing series of blog posts/
presentation.
> >>>>>>
> >>>>>>> Do we consider a fence active as soon as it has been
> >>>>>>> exported to userspace?  Or only after it has been imported back from
> >>>>> userspace
> >>>>>>> by at least one consumer?
> >>>>>>
> >>>>>> We assume them to be active as soon as the fence's fd is returned 
from
> >>>>>> the driver to userspace (we deal with out-fences here).
> >>>>>>
> >>>>>>> Assuming the former (as I guess), what do we need
> >>>>>>> the store batches for in these now modified *await-hang scenarios?  
What
> >>>>> extra
> >>>>>>> value do those scenarios provide compared to (nb-)?wait-hang ?
> >>>>>>
> >>>>>> Hm, I don't quite understand the question here -- are you asking why 
do
> >>>>>> we check the store results? test_fence_await is reused by other
> >>>>>> subtests, not only the hanging cases, and we expect to see the 
stores.
> >>>>>
> >>>>> The patch introduces two processing paths to test_fence_await(), one 
of them
> >>>>> for *-hang variants.  In that *-hang processing path, why do we still 
submit
> >>>>> batches that depend on the fence if we don't examine their execution 
and
> >>>>> results, only their completion?
> >>>>
> >>>> Ah, OK, now I get it. Do you suggest that we should also add a check 
for
> >>>> HANG flag for igt_store_word block? But we still need to send something
> >>>> to test this case.
> >>>
> >>> Why do we need to send something?  What that something should look like?  
IOW,
> >>> what are we trying to exercise with this case?  How is it supposed to be
> >>> different from wait-hang scenario?
> >>
> >> test_fence_await tests pipelining work (here, writing something to
> >> memory in parallel) and cross-checks its status with what's reported by
> >> the fence,
> > 
> > True for *await (no HANG) subtests, but in *await-hang variants I still 
can't
> > recognize checks in place as cross-checks and what value they add compared 
to
> > wait-hang. 
> 
> wait_hang and await_hang are different in the sense of work they do.
> test_fence_busy is a test with a simple batch buffer (with no actual
> writes) that is _iteratively_ executed on different physical engines,
> whereas test_fence_await spawns work on different engines in _parallel_.
> It's a different scenario from the fence's perspective, as we have one
> fence on the spinner and pass it to other engines which (asynchronously)
> wait on it. The value added here is the check for 1 fence - many engines
> (wait_hang has 1-1 mapping).
> 
> > Does a check for invalidated work no longer queued nor running
> > tell us anything about the fence behavior, i.e., if and how it affected /
> > interacted with that work?  I think it doesn't, then, I'm not sure if we 
still
> > need such limited scenarios.
> 
> I don't think I quite understand -- we check if we've sent a hanging 
> workload, which is expected to result in an error, there's no additional
> check for the work itself (how one could check it?). As the error is 
> raised from any of the engines (I'm taking about the await case), the 
> fence is expected to signal, and this is what we check. The fence itself 
> can't interact with that work
> 
> > Please convince me that we do and I'll be happy
> > to push the series with my S-o-b: added.
> 
> I have commit rights now, so I can push it myself, 

OK, let's take a different approach then.  Please feel free to push the series 
with R-b:s collected so far yourself, and I'll try to update and restore 
disabled checks in a follow-up patch.

Thanks,
Janusz

> but I want to address
> all the comments before moving on.
> 
> If you have further questions or my answers are not satisfactory (which 
> is fair), you can reach out to the author. He knows all the bits and
> pieces of this test suite. I'm more of a messenger here.
> 
> All the best,
> Karolina
> 
> > Thanks,
> > Janusz
>