[i-g-t, v2, 1/1] tests/intel/xe_exec_threads: wait for all submissions to complete

Wed Oct 30 00:15:41 UTC 2024

> On Mon, Oct 28, 2024 at 03:53:49PM -0700, fei.yang at intel.com wrote:
>> From: Fei Yang <fei.yang at intel.com>
>>
>> In test_compute_mode, there is an one second sleep waiting for all the
>> submissions to complete, but that is not reliable especially on pre-si
>> platforms where the GPU could be a lot slower. Instead we should wait
>> for the ufence to make sure the GPU is inactive before unbinding the
>> BO.
>>
>> Signed-off-by: Fei Yang <fei.yang at intel.com>
>> ---
>>  tests/intel/xe_exec_threads.c | 29 ++++++++++++++++++++++-------
>>  1 file changed, 22 insertions(+), 7 deletions(-)
>>
>> diff --git a/tests/intel/xe_exec_threads.c
>> b/tests/intel/xe_exec_threads.c index 413d6626b..03043c53e 100644
>> --- a/tests/intel/xe_exec_threads.c
>> +++ b/tests/intel/xe_exec_threads.c
>> @@ -340,7 +340,7 @@ test_compute_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
>>              xe_exec(fd, &exec);
>>
>>              if (flags & REBIND && i && !(i & 0x1f)) {
>> -                    for (j = i - 0x20; j <= i; ++j)
>> +                    for (j = i == 0x20 ? 0 : i - 0x1f; j <= i; ++j)
>
> The change here doesn't seem to be related to or explained in the commit message as far as I can see.  Right now every time the outer loop is on a non-zero iteration that's a multiple of 0x20, it does this inner loop to wait for those last group of execs to complete.  As written today, the test is accidentally waiting twice for some of the execs (e.g., after exec 0x20 the test waits for 0x0-0x20, but after exec 0x40 it waits for 0x20-0x40, accidentally waiting on 0x20 a second time).  It looks like this is an attempt to fix that by starting the inner loop from i-0x1f instead of i-0x20, except for the special case first time to ensure we don't miss out on 0x0.
>

Yes, this change is to avoid waiting on same fence twice. I can separate it into a different patch.

>
> The logic in this test is already pretty confusing and poorly explained.
> Maybe it would be cleaner to replace loop variable j with a running 'last_wait' variable that we use to catch up with all necessary waits?
> E.g.,
>
>         int last_wait = -1;
>         ...
>         for (i = 0; i < n_execs; i++) {
>                 ...
>                 do {
>                         last_wait++;
>                         xe_wait_ufence(... &data[last_wait].exec_sync ...);
>                 } while (last_wait < i);
>                 ...
>
> That said, this change still seems unrelated to the change described in the commit message, so maybe this should be a separate patch?

There is another line of code inside if (flags & INVALIDATE && i && !(i & 0x1f))
down below that does exactly the same thing, I just copied it over because it's
correct. I think we should just keep the code consistent.

>>                              xe_wait_ufence(fd, &data[j].exec_sync,
>>                                             USER_FENCE_VALUE,
>>                                             exec_queues[e], fence_timeout); @@ -404,16 +404,31 @@
>> test_compute_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
>>              }
>>      }
>>
>> -    j = flags & INVALIDATE ?
>> -            (flags & RACE ? n_execs / 2 + 1 : n_execs - 1) : 0;
>> +    j = 0; /* wait for all submissions to complete */
>> +    if (flags & INVALIDATE)
>> +            /*
>> +             * For !RACE cases xe_wait_ufence has been called in above for-loop
>> +             * except the last batch of submissions. For RACE cases we will need
>> +             * to wait for the second half of the submissions to complete. There
>> +             * is a potential race here because the first half submissions might
>> +             * have updated the fence in the old physical location while the test
>> +             * is remapping the buffer from a different physical location, but the
>> +             * wait_ufence only checks the fence from the new location which would
>> +             * never be updated. We have to assume the first half of the submissions
>> +             * complete before the second half.
>
> Doesn't this assumption just bring us back to the same bug we were trying to fix here?  We don't have any guarantees on scheduling order or speed, so the earlier execs might just happen to get scheduled later than the more recent ones, and we don't really know whether the work is finished or not if the sync's aren't reliable.  We can't tell whether they're still truly running, or whether our copying of data from the old buffer to the new buffer read the pre-completion value a split second before the GPU wrote out something new to the old buffer, so when we switch to the new buffer we lose that racing GPU update.
>
> It seems like the proper solution here is really to have a completely separate bind at a separate location that doesn't move or get clobbered that holds the syncs for all of the execs.  If there's a bug in the driver, we'll still see it because we'll get faults from the 0xcoffee data write.
>

Yes, there is still a chance for the race here, but so far it works in my tests, so
sent this as a quick fix and left a comment here.
I'm working on another patch which separates batch buffer and sync data, but don't
want to have too much pressure on this issue, so my intention was to merge this first if
possible, and take some time to make sure the subsequent patch works perfectly.

-Fei

>
> Matt
>
>> +             */
>> +            j = (flags & RACE) ? (n_execs / 2 + 1) : (((n_execs - 1) & ~0x1f) + 1);
>> +    else if (flags & REBIND)
>> +            /*
>> +             * For REBIND cases xe_wait_ufence has been called in above for-loop
>> +             * except the last batch of submissions.
>> +             */
>> +            j = ((n_execs - 1) & ~0x1f) + 1;
>> +
>>      for (i = j; i < n_execs; i++)
>>              xe_wait_ufence(fd, &data[i].exec_sync, USER_FENCE_VALUE,
>>                             exec_queues[i % n_exec_queues], fence_timeout);
>>
>> -    /* Wait for all execs to complete */
>> -    if (flags & INVALIDATE)
>> -            sleep(1);
>> -
>>      sync[0].addr = to_user_pointer(&data[0].vm_sync);
>>      xe_vm_unbind_async(fd, vm, 0, 0, addr, bo_size, sync, 1);
>>      xe_wait_ufence(fd, &data[0].vm_sync, USER_FENCE_VALUE, 0,
>> fence_timeout);
>> --
>> 2.25.1