[igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads

Fri Jul 29 07:38:43 UTC 2022

Hi Janusz,

Thanks a lot for taking a look at the patch.

On 28.07.2022 18:56, Janusz Krzysztofik wrote:
> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote:
>> From: Chris Wilson <chris at chris-wilson.co.uk>
>>
>> test_fence_await verifies if a fence used to pipeline work signals
>> correctly. await-hang and nb-await-hang test cases inject GPU hang,
>> which causes an erroneous state, meaning the fence will be signaled
>> without execution. The error causes an instant reset of the command
>> streamer for the hanging workload. This revealed a problem with how
>> we verify the fence state and results. The test assumes that the
>> error notification happens after a hang is declared, which takes
>> longer than submitting the next set of fences, making the test pass
>> every time. With the immediate reset, this might not happen, so the
>> assertion fails, as there are no active fences in the GPU hang case.
>>
>> Move the check for active fence to the path for non-hanging workload,
>> and verify results only in this scenario.
>>
>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> Signed-off-by: Karolina Drobnik <karolina.drobnik at intel.com>
>> ---
>>   tests/i915/gem_exec_fence.c | 14 ++++++++------
>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/gem_exec_fence.c
>> index d46914c2..260aa82c 100644
>> --- a/tests/i915/gem_exec_fence.c
>> +++ b/tests/i915/gem_exec_fence.c
>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const intel_ctx_t
> *ctx,
>>   	/* Long, but not too long to anger preemption disable checks */
>>   	usleep(50 * 1000); /* 50 ms, typical preempt reset is 150+ms */
>>   
>> -	/* Check for invalidly completing the task early */
>> -	igt_assert(fence_busy(spin->out_fence));
>> -	for (int n = 0; n < i; n++)
>> -		igt_assert_eq_u32(out[n], 0);
>> +	if ((flags & HANG) == 0) {
>> +		/* Check for invalidly completing the task early */
>> +		igt_assert(fence_busy(spin->out_fence));
>> +		for (int n = 0; n < i; n++)
>> +			igt_assert_eq_u32(out[n], 0);
> 
> AFAICU, in the 'hang' variant of the scenario we skip the above asserts
> because the spin batch could have already hanged, then its out fence already
> signalled and store batches waiting for that signal already executed.  If
> that's the case, how do this variant of gem_exec_fence test asserts that the
> fence actually worked as expected?

With this change, yes, we would skip them. Still, the store batches
wouldn't be executed, as we reset the CS on hang as a part of the error 
handling. For valid jobs, we expect to (1) see an active fence at the 
beginning of the request, (2) get a signaled fence after the wait, (3) 
store out[i] == i. With a hang, (1) and (3) would be false.

In this particular loop, we could have garbage here with hang, not 0s 
(although, from my tests it looks like they are zeroed, but maybe I got 
lucky)

>>   
>> -	if ((flags & HANG) == 0)
>>   		igt_spin_end(spin);
>> +	}
>>   
>>   	igt_waitchildren();
>>   
>>   	gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0);
>> -	while (i--)
>> +	igt_assert(!fence_busy(spin->out_fence));
> 
> We only check that the out fence of the presumably hanged spin batch no longer
> blocks execution of store batches.

This check applies to all workloads, all of them should be done with 
work at this point

>> +	while ((flags & HANG) == 0 && i--)
> 
> Besides, why don't we at least assert successful results of store batches?  Do
> we expect them not having their job done correctly when completed after the
> hang of the spin batch have occurred?

We don't expect them to store anything meaningful, because we get a
reset. So, this check only applies to well-behaved jobs.

All the best,
Karolina

> Am I missing something?
> 
> Thanks,
> Janusz
> 
> 
>>   		igt_assert_eq_u32(out[i], i);
>>   	munmap(out, 4096);
>>   
>>
> 
> 
> 
>