[PATCH i-g-t v2] tests/intel/xe_wedged: Avoid racy cleanup
Cavitt, Jonathan
jonathan.cavitt at intel.com
Mon Oct 7 18:13:53 UTC 2024
-----Original Message-----
From: igt-dev <igt-dev-bounces at lists.freedesktop.org> On Behalf Of Matt Roper
Sent: Monday, October 7, 2024 10:48 AM
To: igt-dev at lists.freedesktop.org
Cc: Roper, Matthew D <matthew.d.roper at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>
Subject: [PATCH i-g-t v2] tests/intel/xe_wedged: Avoid racy cleanup
>
> The wedged-at-any-timeout subtest uses wedged_mode=2 and triggers an
> engine hang via IGT spinner, which means the hardware can become wedged
> at any time after the spinner batch is submitted to the kernel. Any
> further ioctl submissions from the test are effectively racing with the
> GPU and might succeed if they run before the GPU/driver notices the
> hang, or might fail with -ECANCELED if the hang has already been
> detected. This is also true for the cleanup at the end of the
> simple_hang() function (i.e., execqueue destruction, gem close, etc.).
> The IGT wrappers for those cleanup ioctls will terminate the test if
> they fail, causing a bogus test failure.
>
> As soon as the spinner batch is submitted to the kernel, we can't
> guarantee that we have time to sneak in any more ioctls, so just drop
> the cleanup actions; they aren't really needed. In a similar vein, we
> technically shouldn't assume that a 1 second wait will guarantee that
> we've reached wedged status (it almost certainly will, but arbitrary
> delays of "this is probably long enough" are never good). Instead, do a
> syncobj_wait on the spinner; that might complete normally or might
> return ECANCELED, either of which is a signal that we can move on with
> the rest of the test.
>
> v2:
> - Rebase to reconcile conflict with rebind refactor.
>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2780
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2747
> Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
> ---
> tests/intel/xe_wedged.c | 44 +++++++++++++++++++++++++++++------------
> 1 file changed, 31 insertions(+), 13 deletions(-)
>
> diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
> index ba790aa8d..88e5d47f2 100644
> --- a/tests/intel/xe_wedged.c
> +++ b/tests/intel/xe_wedged.c
> @@ -142,7 +142,7 @@ simple_exec(int fd, struct drm_xe_engine_class_instance *eci)
> }
>
> static void
> -simple_hang(int fd)
> +simple_hang(int fd, struct drm_xe_sync *sync)
> {
> struct drm_xe_engine_class_instance *eci = &xe_engine(fd, 0)->instance;
> uint32_t vm;
> @@ -163,6 +163,11 @@ simple_hang(int fd)
> struct xe_spin_opts spin_opts = { .preempt = false };
> int err;
>
> + if (sync) {
> + exec_hang.syncs = to_user_pointer(sync);
> + exec_hang.num_syncs = 1;
> + }
> +
> vm = xe_vm_create(fd, 0, 0);
> bo_size = xe_bb_size(fd, sizeof(*data));
> bo = xe_bo_create(fd, vm, bo_size,
> @@ -180,11 +185,6 @@ simple_hang(int fd)
> do {
> err = igt_ioctl(fd, DRM_IOCTL_XE_EXEC, &exec_hang);
> } while (err && errno == ENOMEM);
> -
> - xe_exec_queue_destroy(fd, hang_exec_queue);
> - munmap(data, bo_size);
> - gem_close(fd, bo);
> - xe_vm_destroy(fd, vm);
> }
>
> /**
> @@ -226,19 +226,37 @@ igt_main
> }
>
> igt_subtest_f("wedged-at-any-timeout") {
> + struct drm_xe_sync hang_sync = {
> + .type = DRM_XE_SYNC_TYPE_SYNCOBJ,
> + .flags = DRM_XE_SYNC_FLAG_SIGNAL,
> + };
> + int err;
> +
> igt_require(igt_debugfs_exists(fd, "wedged_mode", O_RDWR));
> ignore_wedged_in_dmesg();
>
> + hang_sync.handle = syncobj_create(fd, 0);
> +
> igt_debugfs_write(fd, "wedged_mode", "2");
> - simple_hang(fd);
> + simple_hang(fd, &hang_sync);
> +
> /*
> - * Any ioctl after the first timeout on wedged_mode=2 is blocked
> - * so we cannot relly on sync objects. Let's wait a bit for
> - * things to settle before we confirm device as wedged and
> - * rebind.
> + * Wait for the hang to be detected. If the hang has already
> + * taken place, this will return ECANCELED and we can just move
> + * on immediately.
> */
> - sleep(1);
> + err = syncobj_wait_err(fd, &hang_sync.handle, 1, INT64_MAX, 0);
> + if (err)
> + igt_assert_eq(err, -ECANCELED);
> +
> + /* Other ioctls should also be returning ECANCELED now */
> igt_assert_neq(simple_ioctl(fd), 0);
> + igt_assert_eq(errno, ECANCELED);
> +
> + /*
> + * Rebind the device and ensure proper operation is restored
> + * for all engines.
> + */
> fd = xe_sysfs_driver_do(fd, pci_slot, XE_SYSFS_DRIVER_REBIND);
> igt_assert_eq(simple_ioctl(fd), 0);
> xe_for_each_engine(fd, hwe)
> @@ -252,7 +270,7 @@ igt_main
> igt_assert_eq(simple_ioctl(fd), 0);
> igt_debugfs_write(fd, "wedged_mode", "1");
> ignore_wedged_in_dmesg();
> - simple_hang(fd);
> + simple_hang(fd, NULL);
I'm guessing we don't need a sync here because we're expecting the hang to
not have an effect on the next ioctl in wedged_mode 1.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
-Jonathan Cavitt
> igt_assert_eq(simple_ioctl(fd), 0);
> }
>
> --
> 2.46.2
>
>
More information about the igt-dev
mailing list