[PATCH i-g-t v2] tests/intel/xe_wedged: Avoid racy cleanup

Mon Oct 7 18:13:53 UTC 2024

-----Original Message-----
From: igt-dev <igt-dev-bounces at lists.freedesktop.org> On Behalf Of Matt Roper
Sent: Monday, October 7, 2024 10:48 AM
To: igt-dev at lists.freedesktop.org
Cc: Roper, Matthew D <matthew.d.roper at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>
Subject: [PATCH i-g-t v2] tests/intel/xe_wedged: Avoid racy cleanup
> 
> The wedged-at-any-timeout subtest uses wedged_mode=2 and triggers an
> engine hang via IGT spinner, which means the hardware can become wedged
> at any time after the spinner batch is submitted to the kernel.  Any
> further ioctl submissions from the test are effectively racing with the
> GPU and might succeed if they run before the GPU/driver notices the
> hang, or might fail with -ECANCELED if the hang has already been
> detected.  This is also true for the cleanup at the end of the
> simple_hang() function (i.e., execqueue destruction, gem close, etc.).
> The IGT wrappers for those cleanup ioctls will terminate the test if
> they fail, causing a bogus test failure.
> 
> As soon as the spinner batch is submitted to the kernel, we can't
> guarantee that we have time to sneak in any more ioctls, so just drop
> the cleanup actions; they aren't really needed.  In a similar vein, we
> technically shouldn't assume that a 1 second wait will guarantee that
> we've reached wedged status (it almost certainly will, but arbitrary
> delays of "this is probably long enough" are never good).  Instead, do a
> syncobj_wait on the spinner; that might complete normally or might
> return ECANCELED, either of which is a signal that we can move on with
> the rest of the test.
> 
> v2:
>  - Rebase to reconcile conflict with rebind refactor.
> 
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2780
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2747
> Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
> ---
>  tests/intel/xe_wedged.c | 44 +++++++++++++++++++++++++++++------------
>  1 file changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
> index ba790aa8d..88e5d47f2 100644
> --- a/tests/intel/xe_wedged.c
> +++ b/tests/intel/xe_wedged.c
> @@ -142,7 +142,7 @@ simple_exec(int fd, struct drm_xe_engine_class_instance *eci)
>  }
>  
>  static void
> -simple_hang(int fd)
> +simple_hang(int fd, struct drm_xe_sync *sync)
>  {
>  	struct drm_xe_engine_class_instance *eci = &xe_engine(fd, 0)->instance;
>  	uint32_t vm;
> @@ -163,6 +163,11 @@ simple_hang(int fd)
>  	struct xe_spin_opts spin_opts = { .preempt = false };
>  	int err;
>  
> +	if (sync) {
> +		exec_hang.syncs = to_user_pointer(sync);
> +		exec_hang.num_syncs = 1;
> +	}
> +
>  	vm = xe_vm_create(fd, 0, 0);
>  	bo_size = xe_bb_size(fd, sizeof(*data));
>  	bo = xe_bo_create(fd, vm, bo_size,
> @@ -180,11 +185,6 @@ simple_hang(int fd)
>  	do {
>  		err =  igt_ioctl(fd, DRM_IOCTL_XE_EXEC, &exec_hang);
>  	} while (err && errno == ENOMEM);
> -
> -	xe_exec_queue_destroy(fd, hang_exec_queue);
> -	munmap(data, bo_size);
> -	gem_close(fd, bo);
> -	xe_vm_destroy(fd, vm);
>  }
>  
>  /**
> @@ -226,19 +226,37 @@ igt_main
>  	}
>  
>  	igt_subtest_f("wedged-at-any-timeout") {
> +		struct drm_xe_sync hang_sync = {
> +			.type = DRM_XE_SYNC_TYPE_SYNCOBJ,
> +			.flags = DRM_XE_SYNC_FLAG_SIGNAL,
> +		};
> +		int err;
> +
>  		igt_require(igt_debugfs_exists(fd, "wedged_mode", O_RDWR));
>  		ignore_wedged_in_dmesg();
>  
> +		hang_sync.handle = syncobj_create(fd, 0);
> +
>  		igt_debugfs_write(fd, "wedged_mode", "2");
> -		simple_hang(fd);
> +		simple_hang(fd, &hang_sync);
> +
>  		/*
> -		 * Any ioctl after the first timeout on wedged_mode=2 is blocked
> -		 * so we cannot relly on sync objects. Let's wait a bit for
> -		 * things to settle before we confirm device as wedged and
> -		 * rebind.
> +		 * Wait for the hang to be detected.  If the hang has already
> +		 * taken place, this will return ECANCELED and we can just move
> +		 * on immediately.
>  		 */
> -		sleep(1);
> +		err = syncobj_wait_err(fd, &hang_sync.handle, 1, INT64_MAX, 0);
> +		if (err)
> +			igt_assert_eq(err, -ECANCELED);
> +
> +		/* Other ioctls should also be returning ECANCELED now */
>  		igt_assert_neq(simple_ioctl(fd), 0);
> +		igt_assert_eq(errno, ECANCELED);
> +
> +		/*
> +		 * Rebind the device and ensure proper operation is restored
> +		 * for all engines.
> +		 */
>  		fd = xe_sysfs_driver_do(fd, pci_slot, XE_SYSFS_DRIVER_REBIND);
>  		igt_assert_eq(simple_ioctl(fd), 0);
>  		xe_for_each_engine(fd, hwe)
> @@ -252,7 +270,7 @@ igt_main
>  		igt_assert_eq(simple_ioctl(fd), 0);
>  		igt_debugfs_write(fd, "wedged_mode", "1");
>  		ignore_wedged_in_dmesg();
> -		simple_hang(fd);
> +		simple_hang(fd, NULL);

I'm guessing we don't need a sync here because we're expecting the hang to
not have an effect on the next ioctl in wedged_mode 1.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
-Jonathan Cavitt

>  		igt_assert_eq(simple_ioctl(fd), 0);
>  	}
>  
> -- 
> 2.46.2
> 
>