[PATCH 0/4] Introduce a wedged state
Rodrigo Vivi
rodrigo.vivi at intel.com
Wed Apr 3 15:07:28 UTC 2024
Let's introduce a wedged state in Xe.
First of all, we need some more robust protection here.
One of the current biggest pain developers are facing with
Xe is that if something goes wrong and gt-reset cannot solve
the issue or when guc load has failed, we would like to
stay in a system where we can debug and reload the module,
instead of getting forced to reboot the system and start over.
The first patch of this series is enough for this protection,
and it is ready for that.
The next reasoning for this series is that in many cases,
software validation teams need to work in a system where
the memory is intact after the the first hang. For this
case, the next change in this series adds a very aggressive
wedged mode, where guc is blocked to attempt a reset and
our xe driver doesn't do any clean-up other than just signal
some fences.
In this aggressive mode it is expected that some memory will
not be clean and even some leaks might happen. Later, during
rebind and even module reload, many kernel warns will pop-up.
But again, the target here is not to have a safe fail out,
but only a specialized debug of the gpu hang. While
writing this, I realize that I need to add a config protection
to this mode to be only used in debug+expert mode. This will
come soon while addressing some comments that this new version
of the series might receive.
Also, this series includes some IGT tests (sent to igt-dev
mailing list:
1. basic-wedge: tests the mode 1 with a fail inject in the
gt-reset then check if accesses are blocked
and do a rebind to restore the access and
confirm with a simple exec.
2. wedged-at-any-timeout: tests the aggressive mode 2 described
above, configuring with the debugfs, and
then running a simple-hang case. Then
also checking for the blocked access,
and then rebind to restore the
device and confirm with a simple good
execution.
$ sudo ./build/tests/xe_wedged
IGT-Version: 1.28-ged62883ec967 (x86_64) (Linux: 6.9.0-rc1+ x86_64)
Using IGT_SRANDOM=1712103638 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-wedged
Subtest basic-wedged: SUCCESS (2.855s)
Starting subtest: wedged-at-any-timeout
Subtest wedged-at-any-timeout: SUCCESS (4.580s)
Thanks,
Rodrigo.
Rodrigo Vivi (4):
drm/xe: Introduce a simple wedged state
drm/xe: declare wedged upon GuC load failure
drm/xe: Force wedged state and block GT reset upon any GPU hang
drm/xe: Introduce the wedged_mode debugfs
drivers/gpu/drm/xe/xe_debugfs.c | 56 ++++++++++++++++++++
drivers/gpu/drm/xe/xe_device.c | 55 ++++++++++++++++++++
drivers/gpu/drm/xe/xe_device.h | 7 +++
drivers/gpu/drm/xe/xe_device_types.h | 10 ++++
drivers/gpu/drm/xe/xe_gt.c | 5 +-
drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 3 ++
drivers/gpu/drm/xe/xe_guc.c | 41 +++++++--------
drivers/gpu/drm/xe/xe_guc_ads.c | 57 ++++++++++++++++++++-
drivers/gpu/drm/xe/xe_guc_ads.h | 1 +
drivers/gpu/drm/xe/xe_guc_ct.c | 8 +++
drivers/gpu/drm/xe/xe_guc_pc.c | 3 ++
drivers/gpu/drm/xe/xe_guc_submit.c | 47 ++++++++++++++---
drivers/gpu/drm/xe/xe_module.c | 5 ++
drivers/gpu/drm/xe/xe_module.h | 1 +
14 files changed, 267 insertions(+), 32 deletions(-)
--
2.44.0
More information about the Intel-xe
mailing list