[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Tue Aug 2 10:20:29 UTC 2022

Hi,

On 01.08.2022 16:43, Janusz Krzysztofik wrote:
> On Monday, 1 August 2022 15:39:36 CEST Karolina Drobnik wrote:
>> Hi,
>>
>> On 01.08.2022 13:54, Janusz Krzysztofik wrote:
>>> On Monday, 1 August 2022 11:51:27 CEST Karolina Drobnik wrote:
>>>> Hi,
>>>>
>>>> On 01.08.2022 10:11, Janusz Krzysztofik wrote:
>>>>> On Monday, 1 August 2022 07:53:54 CEST Karolina Drobnik wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 29.07.2022 17:23, Janusz Krzysztofik wrote:
>>>>>>> On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote:
>>>>>>>> Hi Janusz,
>>>>>>>>
>>>>>>>> On 29.07.2022 10:24, Janusz Krzysztofik wrote:
>>>>>>>>> Hi Karolina,
>>>>>>>>>
>>>>>>>>> On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
>>>>>>>>>> Hi Janusz,
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for taking a look at the patch.
>>>>>>>>>>
>>>>>>>>>> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
>>>>>>>>>>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
>>>>>>>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>>>>>>>
>>>>>>>>>>>> test_fence_await verifies if a fence used to pipeline work signals
>>>>>>>>>>>> correctly. await-hang and nb-await-hang test cases inject GPU hang,
>>>>>>>>>>>> which causes an erroneous state, meaning the fence will be signaled
>>>>>>>>>>>> without execution. The error causes an instant reset of the command
>>>>>>>>>>>> streamer for the hanging workload. This revealed a problem with how
>>>>>>>>>>>> we verify the fence state and results. The test assumes that the
>>>>>>>>>>>> error notification happens after a hang is declared, which takes
>>>>>>>>>>>> longer than submitting the next set of fences, making the test pass
>>>>>>>>>>>> every time. With the immediate reset, this might not happen, so the
>>>>>>>>>>>> assertion fails, as there are no active fences in the GPU hang case.
>>>>>>>>>>>>
>>>>>>>>>>>> Move the check for active fence to the path for non-hanging workload,
>>>>>>>>>>>> and verify results only in this scenario.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>>>>>>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        tests/i915/gem_exec_fence.c | 14 ++++++++------
>>>>>>>>>>>>        1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/
>>>>> gem_exec_fence.c
>>>>>>>>>>>> index d46914c2..260aa82c 100644
>>>>>>>>>>>> --- a/tests/i915/gem_exec_fence.c
>>>>>>>>>>>> +++ b/tests/i915/gem_exec_fence.c
>>>>>>>>>>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const
>>>>>>>>> intel_ctx_t
>>>>>>>>>>> *ctx,
>>>>>> (...)
>>>>>>>>>>>>        
>>>>>>>>>>>> -	if ((flags & HANG) == 0)
>>>>>>>>>>>>        		igt_spin_end(spin);
>>>>>>>>>>>> +	}
>>>>>>>>>>>>        
>>>>>>>>>>>>        	igt_waitchildren();
>>>>>>>>>>>>        
>>>>>>>>>>>>        	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
>>>>>>>>>>>> -	while (i--)
>>>>>>>>>>>> +	igt_assert(!fence_busy(spin->out_fence));
>>>>>>>>>>>
>>>>>>>>>>> We only check that the out fence of the presumably hanged spin batch
>>>>> no
>>>>>>>>> longer
>>>>>>>>>>> blocks execution of store batches.
>>>>>>>>>>
>>>>>>>>>> This check applies to all workloads, all of them should be done with
>>>>>>>>>> work at this point
>>>>>>>>>
>>>>>>>>> OK, but since that's the only explicit assert in the *-hang processing
>>>>> path,
>>>>>>>>> does it tell us anything about fencing working or not?
>>>>>>>>
>>>>>>>> It says that we were given an active fence, we wait at it and hope it
>>>>>>>> signals when an error is reported. Like I said, we can't check the
>>>>>>>> results itself, as they are meaningless with the reset. If we have an
>>>>>>>> active fence at this point, that's bad, and the test should fail.
>>>>>>>>
>>>>>>>>> I think it doesn't,
>>>>>>>>> and as long as I'm not wrong, I think such scenarios hardly belong to
>>>>>>>>> gem_exec_fence.
>>>>>>>>
>>>>>>>> Hm, I'm not sure if I follow, but this exact transition (from active ->
>>>>>>>> (record error) -> signal) is one of the possible scenarios we wish to
>>>>>>>> test.
>>>>>>>
>>>>>>> OK, so we check if an active fence is signalled on error.  But then, what
>>>>> does
>>>>>>> 'active' mean here?
>>>>>>
>>>>>> Active means that the fence was sent and we wait to hear back from the
>>>>>> GPU. If you'd like to see what are the state transitions in fences, you
>>>>>> can check out Mainline Explicit Fencing series of blog posts/presentation.
>>>>>>
>>>>>>> Do we consider a fence active as soon as it has been
>>>>>>> exported to userspace?  Or only after it has been imported back from
>>>>> userspace
>>>>>>> by at least one consumer?
>>>>>>
>>>>>> We assume them to be active as soon as the fence's fd is returned from
>>>>>> the driver to userspace (we deal with out-fences here).
>>>>>>
>>>>>>> Assuming the former (as I guess), what do we need
>>>>>>> the store batches for in these now modified *await-hang scenarios?  What
>>>>> extra
>>>>>>> value do those scenarios provide compared to (nb-)?wait-hang ?
>>>>>>
>>>>>> Hm, I don't quite understand the question here -- are you asking why do
>>>>>> we check the store results? test_fence_await is reused by other
>>>>>> subtests, not only the hanging cases, and we expect to see the stores.
>>>>>
>>>>> The patch introduces two processing paths to test_fence_await(), one of them
>>>>> for *-hang variants.  In that *-hang processing path, why do we still submit
>>>>> batches that depend on the fence if we don't examine their execution and
>>>>> results, only their completion?
>>>>
>>>> Ah, OK, now I get it. Do you suggest that we should also add a check for
>>>> HANG flag for igt_store_word block? But we still need to send something
>>>> to test this case.
>>>
>>> Why do we need to send something?  What that something should look like?  IOW,
>>> what are we trying to exercise with this case?  How is it supposed to be
>>> different from wait-hang scenario?
>>
>> test_fence_await tests pipelining work (here, writing something to
>> memory in parallel) and cross-checks its status with what's reported by
>> the fence,
> 
> True for *await (no HANG) subtests, but in *await-hang variants I still can't
> recognize checks in place as cross-checks and what value they add compared to
> wait-hang. 

wait_hang and await_hang are different in the sense of work they do.
test_fence_busy is a test with a simple batch buffer (with no actual
writes) that is _iteratively_ executed on different physical engines,
whereas test_fence_await spawns work on different engines in _parallel_.
It's a different scenario from the fence's perspective, as we have one
fence on the spinner and pass it to other engines which (asynchronously)
wait on it. The value added here is the check for 1 fence - many engines
(wait_hang has 1-1 mapping).

> Does a check for invalidated work no longer queued nor running
> tell us anything about the fence behavior, i.e., if and how it affected /
> interacted with that work?  I think it doesn't, then, I'm not sure if we still
> need such limited scenarios.

I don't think I quite understand -- we check if we've sent a hanging 
workload, which is expected to result in an error, there's no additional
check for the work itself (how one could check it?). As the error is 
raised from any of the engines (I'm taking about the await case), the 
fence is expected to signal, and this is what we check. The fence itself 
can't interact with that work

> Please convince me that we do and I'll be happy
> to push the series with my S-o-b: added.

I have commit rights now, so I can push it myself, but I want to address
all the comments before moving on.

If you have further questions or my answers are not satisfactory (which 
is fair), you can reach out to the author. He knows all the bits and
pieces of this test suite. I'm more of a messenger here.

All the best,
Karolina

> Thanks,
> Janusz