[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Mon Aug 1 13:39:36 UTC 2022

Hi,

On 01.08.2022 13:54, Janusz Krzysztofik wrote:
> On Monday, 1 August 2022 11:51:27 CEST Karolina Drobnik wrote:
>> Hi,
>>
>> On 01.08.2022 10:11, Janusz Krzysztofik wrote:
>>> On Monday, 1 August 2022 07:53:54 CEST Karolina Drobnik wrote:
>>>> Hi,
>>>>
>>>> On 29.07.2022 17:23, Janusz Krzysztofik wrote:
>>>>> On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote:
>>>>>> Hi Janusz,
>>>>>>
>>>>>> On 29.07.2022 10:24, Janusz Krzysztofik wrote:
>>>>>>> Hi Karolina,
>>>>>>>
>>>>>>> On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote:
>>>>>>>> Hi Janusz,
>>>>>>>>
>>>>>>>> Thanks a lot for taking a look at the patch.
>>>>>>>>
>>>>>>>> On 28.07.2022 18:56, Janusz Krzysztofik wrote:
>>>>>>>>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
>>>>>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>>>>>
>>>>>>>>>> test_fence_await verifies if a fence used to pipeline work signals
>>>>>>>>>> correctly. await-hang and nb-await-hang test cases inject GPU hang,
>>>>>>>>>> which causes an erroneous state, meaning the fence will be signaled
>>>>>>>>>> without execution. The error causes an instant reset of the command
>>>>>>>>>> streamer for the hanging workload. This revealed a problem with how
>>>>>>>>>> we verify the fence state and results. The test assumes that the
>>>>>>>>>> error notification happens after a hang is declared, which takes
>>>>>>>>>> longer than submitting the next set of fences, making the test pass
>>>>>>>>>> every time. With the immediate reset, this might not happen, so the
>>>>>>>>>> assertion fails, as there are no active fences in the GPU hang case.
>>>>>>>>>>
>>>>>>>>>> Move the check for active fence to the path for non-hanging workload,
>>>>>>>>>> and verify results only in this scenario.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>>>>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
>>>>>>>>>> ---
>>>>>>>>>>       tests/i915/gem_exec_fence.c | 14 ++++++++------
>>>>>>>>>>       1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/
>>> gem_exec_fence.c
>>>>>>>>>> index d46914c2..260aa82c 100644
>>>>>>>>>> --- a/tests/i915/gem_exec_fence.c
>>>>>>>>>> +++ b/tests/i915/gem_exec_fence.c
>>>>>>>>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const
>>>>>>> intel_ctx_t
>>>>>>>>> *ctx,
>>>> (...)
>>>>>>>>>>       
>>>>>>>>>> -	if ((flags & HANG) == 0)
>>>>>>>>>>       		igt_spin_end(spin);
>>>>>>>>>> +	}
>>>>>>>>>>       
>>>>>>>>>>       	igt_waitchildren();
>>>>>>>>>>       
>>>>>>>>>>       	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
>>>>>>>>>> -	while (i--)
>>>>>>>>>> +	igt_assert(!fence_busy(spin->out_fence));
>>>>>>>>>
>>>>>>>>> We only check that the out fence of the presumably hanged spin batch
>>> no
>>>>>>> longer
>>>>>>>>> blocks execution of store batches.
>>>>>>>>
>>>>>>>> This check applies to all workloads, all of them should be done with
>>>>>>>> work at this point
>>>>>>>
>>>>>>> OK, but since that's the only explicit assert in the *-hang processing
>>> path,
>>>>>>> does it tell us anything about fencing working or not?
>>>>>>
>>>>>> It says that we were given an active fence, we wait at it and hope it
>>>>>> signals when an error is reported. Like I said, we can't check the
>>>>>> results itself, as they are meaningless with the reset. If we have an
>>>>>> active fence at this point, that's bad, and the test should fail.
>>>>>>
>>>>>>> I think it doesn't,
>>>>>>> and as long as I'm not wrong, I think such scenarios hardly belong to
>>>>>>> gem_exec_fence.
>>>>>>
>>>>>> Hm, I'm not sure if I follow, but this exact transition (from active ->
>>>>>> (record error) -> signal) is one of the possible scenarios we wish to
>>>>>> test.
>>>>>
>>>>> OK, so we check if an active fence is signalled on error.  But then, what
>>> does
>>>>> 'active' mean here?
>>>>
>>>> Active means that the fence was sent and we wait to hear back from the
>>>> GPU. If you'd like to see what are the state transitions in fences, you
>>>> can check out Mainline Explicit Fencing series of blog posts/presentation.
>>>>
>>>>> Do we consider a fence active as soon as it has been
>>>>> exported to userspace?  Or only after it has been imported back from
>>> userspace
>>>>> by at least one consumer?
>>>>
>>>> We assume them to be active as soon as the fence's fd is returned from
>>>> the driver to userspace (we deal with out-fences here).
>>>>
>>>>> Assuming the former (as I guess), what do we need
>>>>> the store batches for in these now modified *await-hang scenarios?  What
>>> extra
>>>>> value do those scenarios provide compared to (nb-)?wait-hang ?
>>>>
>>>> Hm, I don't quite understand the question here -- are you asking why do
>>>> we check the store results? test_fence_await is reused by other
>>>> subtests, not only the hanging cases, and we expect to see the stores.
>>>
>>> The patch introduces two processing paths to test_fence_await(), one of them
>>> for *-hang variants.  In that *-hang processing path, why do we still submit
>>> batches that depend on the fence if we don't examine their execution and
>>> results, only their completion?
>>
>> Ah, OK, now I get it. Do you suggest that we should also add a check for
>> HANG flag for igt_store_word block? But we still need to send something
>> to test this case.
> 
> Why do we need to send something?  What that something should look like?  IOW,
> what are we trying to exercise with this case?  How is it supposed to be
> different from wait-hang scenario?

test_fence_await tests pipelining work (here, writing something to 
memory in parallel) and cross-checks its status with what's reported by 
the fence, whereas test_fence_busy checks if the fence signals to 
userspace after hang (a more general case). I'm sorry, I didn't make 
this distinction clear in the beginning.

>>
>>> They get completed regardless of fencing working or not, I believe, > then that part of the *-hang processing path
>>> related to store batches seems useless for me in a test focused on fencing.
>>
>> Do they get completed? We'll get a CS reset and this batch won't
>> complete.
> 
> OK, by completed I meant no longer queued nor running.

OK, I see -- yes, that's correct

Thanks,
Karolina

>> Yes, in the case of hangs we just check the fence,
> 
> Again, how this is supposed to be different from wait-hang scenario?
> 
> Thanks,
> Janusz
> 
>> but in
>> general we do care about these stores. This patch aims to fix the
>> problem uncovered during manual testing (*-hang variants are not run in
>> CI), it does not try to rewrite the test (because that's what would be
>> required to completely separate these two test cases).
>>
>> Many thanks,
>> Karolina
>>
>>> Thanks,
>>> Janusz
>>>
>>>>
>>>> Many thanks,
>>>> Karolina
>>>>
>>>>> Thanks,
>>>>> Janusz
>>>>
>>>
>>>
>>>
>>>
>>
> 
> 
> 
>