[i-g-t, v2, 1/1] tests/intel/xe_exec_threads: wait for all submissions to complete

Tue Oct 29 23:38:34 UTC 2024

On Mon, Oct 28, 2024 at 03:53:49PM -0700, fei.yang at intel.com wrote:
> From: Fei Yang <fei.yang at intel.com>
> 
> In test_compute_mode, there is an one second sleep waiting for all
> the submissions to complete, but that is not reliable especially
> on pre-si platforms where the GPU could be a lot slower. Instead we
> should wait for the ufence to make sure the GPU is inactive before
> unbinding the BO.
> 
> Signed-off-by: Fei Yang <fei.yang at intel.com>
> ---
>  tests/intel/xe_exec_threads.c | 29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/tests/intel/xe_exec_threads.c b/tests/intel/xe_exec_threads.c
> index 413d6626b..03043c53e 100644
> --- a/tests/intel/xe_exec_threads.c
> +++ b/tests/intel/xe_exec_threads.c
> @@ -340,7 +340,7 @@ test_compute_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
>  		xe_exec(fd, &exec);
>  
>  		if (flags & REBIND && i && !(i & 0x1f)) {
> -			for (j = i - 0x20; j <= i; ++j)
> +			for (j = i == 0x20 ? 0 : i - 0x1f; j <= i; ++j)

The change here doesn't seem to be related to or explained in the commit
message as far as I can see.  Right now every time the outer loop is on
a non-zero iteration that's a multiple of 0x20, it does this inner loop
to wait for those last group of execs to complete.  As written today,
the test is accidentally waiting twice for some of the execs (e.g.,
after exec 0x20 the test waits for 0x0-0x20, but after exec 0x40 it
waits for 0x20-0x40, accidentally waiting on 0x20 a second time).  It
looks like this is an attempt to fix that by starting the inner loop
from i-0x1f instead of i-0x20, except for the special case first time to
ensure we don't miss out on 0x0.

The logic in this test is already pretty confusing and poorly explained.
Maybe it would be cleaner to replace loop variable j with a running
'last_wait' variable that we use to catch up with all necessary waits?
E.g.,

        int last_wait = -1;
        ...
        for (i = 0; i < n_execs; i++) {
                ...
                do { 
                        last_wait++;
                        xe_wait_ufence(... &data[last_wait].exec_sync ...);
                } while (last_wait < i);
                ...

That said, this change still seems unrelated to the change described in
the commit message, so maybe this should be a separate patch?

>  				xe_wait_ufence(fd, &data[j].exec_sync,
>  					       USER_FENCE_VALUE,
>  					       exec_queues[e], fence_timeout);
> @@ -404,16 +404,31 @@ test_compute_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
>  		}
>  	}
>  
> -	j = flags & INVALIDATE ?
> -		(flags & RACE ? n_execs / 2 + 1 : n_execs - 1) : 0;
> +	j = 0; /* wait for all submissions to complete */
> +	if (flags & INVALIDATE)
> +		/*
> +		 * For !RACE cases xe_wait_ufence has been called in above for-loop
> +		 * except the last batch of submissions. For RACE cases we will need
> +		 * to wait for the second half of the submissions to complete. There
> +		 * is a potential race here because the first half submissions might
> +		 * have updated the fence in the old physical location while the test
> +		 * is remapping the buffer from a different physical location, but the
> +		 * wait_ufence only checks the fence from the new location which would
> +		 * never be updated. We have to assume the first half of the submissions
> +		 * complete before the second half.

Doesn't this assumption just bring us back to the same bug we were
trying to fix here?  We don't have any guarantees on scheduling order or
speed, so the earlier execs might just happen to get scheduled later
than the more recent ones, and we don't really know whether the work is
finished or not if the sync's aren't reliable.  We can't tell whether
they're still truly running, or whether our copying of data from the old
buffer to the new buffer read the pre-completion value a split second
before the GPU wrote out something new to the old buffer, so when we
switch to the new buffer we lose that racing GPU update.

It seems like the proper solution here is really to have a completely
separate bind at a separate location that doesn't move or get clobbered
that holds the syncs for all of the execs.  If there's a bug in the
driver, we'll still see it because we'll get faults from the 0xcoffee
data write.

Matt

> +		 */
> +		j = (flags & RACE) ? (n_execs / 2 + 1) : (((n_execs - 1) & ~0x1f) + 1);
> +	else if (flags & REBIND)
> +		/*
> +		 * For REBIND cases xe_wait_ufence has been called in above for-loop
> +		 * except the last batch of submissions.
> +		 */
> +		j = ((n_execs - 1) & ~0x1f) + 1;
> +
>  	for (i = j; i < n_execs; i++)
>  		xe_wait_ufence(fd, &data[i].exec_sync, USER_FENCE_VALUE,
>  			       exec_queues[i % n_exec_queues], fence_timeout);
>  
> -	/* Wait for all execs to complete */
> -	if (flags & INVALIDATE)
> -		sleep(1);
> -
>  	sync[0].addr = to_user_pointer(&data[0].vm_sync);
>  	xe_vm_unbind_async(fd, vm, 0, 0, addr, bo_size, sync, 1);
>  	xe_wait_ufence(fd, &data[0].vm_sync, USER_FENCE_VALUE, 0, fence_timeout);
> -- 
> 2.25.1
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation