[PATCH 0/4] Introduce a wedged state

Rodrigo Vivi rodrigo.vivi at intel.com
Wed Apr 3 15:07:28 UTC 2024


Let's introduce a wedged state in Xe.

First of all, we need some more robust protection here.
One of the current biggest pain developers are facing with
Xe is that if something goes wrong and gt-reset cannot solve
the issue or when guc load has failed, we would like to
stay in a system where we can debug and reload the module,
instead of getting forced to reboot the system and start over.

The first patch of this series is enough for this protection,
and it is ready for that.

The next reasoning for this series is that in many cases,
software validation teams need to work in a system where
the memory is intact after the the first hang. For this
case, the next change in this series adds a very aggressive
wedged mode, where guc is blocked to attempt a reset and
our xe driver doesn't do any clean-up other than just signal
some fences.

In this aggressive mode it is expected that some memory will
not be clean and even some leaks might happen. Later, during
rebind and even module reload, many kernel warns will pop-up.
But again, the target here is not to have a safe fail out,
but only a specialized debug of the gpu hang. While
writing this, I realize that I need to add a config protection
to this mode to be only used in debug+expert mode. This will
come soon while addressing some comments that this new version
of the series might receive.

Also, this series includes some IGT tests (sent to igt-dev
mailing list:
1. basic-wedge: tests the mode 1 with a fail inject in the
   		gt-reset then check if accesses are blocked
		and do a rebind to restore the access and
		confirm with a simple exec.
2. wedged-at-any-timeout: tests the aggressive mode 2 described
   			  above, configuring with the debugfs, and
			  then running a simple-hang case. Then
			  also checking for the blocked access,
			  and then rebind to restore the
			  device and confirm with a simple good
			  execution.

$ sudo ./build/tests/xe_wedged
IGT-Version: 1.28-ged62883ec967 (x86_64) (Linux: 6.9.0-rc1+ x86_64)
Using IGT_SRANDOM=1712103638 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-wedged
Subtest basic-wedged: SUCCESS (2.855s)
Starting subtest: wedged-at-any-timeout
Subtest wedged-at-any-timeout: SUCCESS (4.580s)

Thanks,
Rodrigo.

Rodrigo Vivi (4):
  drm/xe: Introduce a simple wedged state
  drm/xe: declare wedged upon GuC load failure
  drm/xe: Force wedged state and block GT reset upon any GPU hang
  drm/xe: Introduce the wedged_mode debugfs

 drivers/gpu/drm/xe/xe_debugfs.c             | 56 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.c              | 55 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h              |  7 +++
 drivers/gpu/drm/xe/xe_device_types.h        | 10 ++++
 drivers/gpu/drm/xe/xe_gt.c                  |  5 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |  3 ++
 drivers/gpu/drm/xe/xe_guc.c                 | 41 +++++++--------
 drivers/gpu/drm/xe/xe_guc_ads.c             | 57 ++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_guc_ads.h             |  1 +
 drivers/gpu/drm/xe/xe_guc_ct.c              |  8 +++
 drivers/gpu/drm/xe/xe_guc_pc.c              |  3 ++
 drivers/gpu/drm/xe/xe_guc_submit.c          | 47 ++++++++++++++---
 drivers/gpu/drm/xe/xe_module.c              |  5 ++
 drivers/gpu/drm/xe/xe_module.h              |  1 +
 14 files changed, 267 insertions(+), 32 deletions(-)

-- 
2.44.0



More information about the Intel-xe mailing list