[PATCH i-g-t] tests/intel/xe_wedged: Avoid racy cleanup
Matt Roper
matthew.d.roper at intel.com
Fri Oct 4 00:17:16 UTC 2024
The wedged-at-any-timeout subtest uses wedged_mode=2 and triggers an
engine hang via IGT spinner, which means the hardware can become wedged
at any time after the spinner batch is submitted to the kernel. Any
further ioctl submissions from the test are effectively racing with the
GPU and might succeed if they run before the GPU/driver notices the
hang, or might fail with -ECANCELED if the hang has already been
detected. This is also true for the cleanup at the end of the
simple_hang() function (i.e., execqueue destruction, gem close, etc.).
The IGT wrappers for those cleanup ioctls will terminate the test if
they fail, causing a bogus test failure.
As soon as the spinner batch is submitted to the kernel, we can't
guarantee that we have time to sneak in any more ioctls, so just drop
the cleanup actions; they aren't really needed. In a similar vein, we
technically shouldn't assume that a 1 second wait will guarantee that
we've reached wedged status (it almost certainly will, but arbitrary
delays of "this is probably long enough" are never good). Instead, do a
syncobj_wait on the spinner; that might complete normally or might
return ECANCELED, either of which is a signal that we can move on with
the rest of the test.
Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2780
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2747
Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
---
tests/intel/xe_wedged.c | 44 +++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 13 deletions(-)
diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
index a3f7a697f..3678abf13 100644
--- a/tests/intel/xe_wedged.c
+++ b/tests/intel/xe_wedged.c
@@ -170,7 +170,7 @@ simple_exec(int fd, struct drm_xe_engine_class_instance *eci)
}
static void
-simple_hang(int fd)
+simple_hang(int fd, struct drm_xe_sync *sync)
{
struct drm_xe_engine_class_instance *eci = &xe_engine(fd, 0)->instance;
uint32_t vm;
@@ -191,6 +191,11 @@ simple_hang(int fd)
struct xe_spin_opts spin_opts = { .preempt = false };
int err;
+ if (sync) {
+ exec_hang.syncs = to_user_pointer(sync);
+ exec_hang.num_syncs = 1;
+ }
+
vm = xe_vm_create(fd, 0, 0);
bo_size = xe_bb_size(fd, sizeof(*data));
bo = xe_bo_create(fd, vm, bo_size,
@@ -208,11 +213,6 @@ simple_hang(int fd)
do {
err = igt_ioctl(fd, DRM_IOCTL_XE_EXEC, &exec_hang);
} while (err && errno == ENOMEM);
-
- xe_exec_queue_destroy(fd, hang_exec_queue);
- munmap(data, bo_size);
- gem_close(fd, bo);
- xe_vm_destroy(fd, vm);
}
/**
@@ -252,19 +252,37 @@ igt_main
}
igt_subtest_f("wedged-at-any-timeout") {
+ struct drm_xe_sync hang_sync = {
+ .type = DRM_XE_SYNC_TYPE_SYNCOBJ,
+ .flags = DRM_XE_SYNC_FLAG_SIGNAL,
+ };
+ int err;
+
igt_require(igt_debugfs_exists(fd, "wedged_mode", O_RDWR));
ignore_wedged_in_dmesg();
+ hang_sync.handle = syncobj_create(fd, 0);
+
igt_debugfs_write(fd, "wedged_mode", "2");
- simple_hang(fd);
+ simple_hang(fd, &hang_sync);
+
/*
- * Any ioctl after the first timeout on wedged_mode=2 is blocked
- * so we cannot relly on sync objects. Let's wait a bit for
- * things to settle before we confirm device as wedged and
- * rebind.
+ * Wait for the hang to be detected. If the hang has already
+ * taken place, this will return ECANCELED and we can just move
+ * on immediately.
*/
- sleep(1);
+ err = syncobj_wait_err(fd, &hang_sync.handle, 1, INT64_MAX, 0);
+ if (err)
+ igt_assert_eq(err, -ECANCELED);
+
+ /* Other ioctls should also be returning ECANCELED now */
igt_assert_neq(simple_ioctl(fd), 0);
+ igt_assert_eq(errno, ECANCELED);
+
+ /*
+ * Rebind the device and ensure proper operation is restored
+ * for all engines.
+ */
fd = rebind_xe(fd);
igt_assert_eq(simple_ioctl(fd), 0);
xe_for_each_engine(fd, hwe)
@@ -278,7 +296,7 @@ igt_main
igt_assert_eq(simple_ioctl(fd), 0);
igt_debugfs_write(fd, "wedged_mode", "1");
ignore_wedged_in_dmesg();
- simple_hang(fd);
+ simple_hang(fd, NULL);
igt_assert_eq(simple_ioctl(fd), 0);
}
--
2.46.2
More information about the igt-dev
mailing list