[PATCH 00/34] GC per queue reset

Friedrich Vock friedrich.vock at gmx.de
Thu Jul 18 16:29:26 UTC 2024


Hi,

On 18.07.24 16:06, Alex Deucher wrote:
> This adds preliminary support for GC per queue reset.  In this
> case, only the jobs currently in the queue are lost.  If this
> fails, we fall back to a full adapter reset.

First of all, thank you so much for working on this! It's great to
finally see progress in making GPU resets better.

I've just taken this patchset (together with your other
patchsets[1][2][3]) for a quick spin on my
Navi21 with the GPU reset tests[4] I had written a while ago - the
current patchset sadly seems to have some regressions WRT recovery there.

I ran the tests under my Plasma Wayland session once - this triggered a
list double-add in drm_sched_stop (calltrace follows):

? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? exc_invalid_op (arch/x86/kernel/traps.c:266)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
process_one_work (kernel/workqueue.c:2633)
worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
? __pfx_worker_thread (kernel/workqueue.c:2733)
kthread (kernel/kthread.c:388)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork (arch/x86/kernel/process.c:147)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork_asm (arch/x86/entry/entry_64.S:251)

When running the tests without a desktop environment active, the
double-add disappeared, but the GPU reset still didn't go well - the TTY
remained frozen and the kernel log contained a few messages like:

[drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

which I guess means at least the display subsystem is hung.

Hope this info is enough to repro/investigate.

Thanks,
Friedrich

[1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@amd.com/T/#t
[2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@amd.com/T/#t
[3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@amd.com/T/#t
[4] https://gitlab.steamos.cloud/holo/HangTestSuite


>
> Alex Deucher (19):
>    drm/amdgpu/mes: add API for legacy queue reset
>    drm/amdgpu/mes11: add API for legacy queue reset
>    drm/amdgpu/mes12: add API for legacy queue reset
>    drm/amdgpu/mes: add API for user queue reset
>    drm/amdgpu/mes11: add API for user queue reset
>    drm/amdgpu/mes12: add API for user queue reset
>    drm/amdgpu: add new ring reset callback
>    drm/amdgpu: add per ring reset support (v2)
>    drm/amdgpu/gfx11: add ring reset callbacks
>    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>    drm/amdgpu/gfx10: add ring reset callbacks
>    drm/amdgpu/gfx10: rework reset sequence
>    drm/amdgpu/gfx9: add ring reset callback
>    drm/amdgpu/gfx9.4.3: add ring reset callback
>    drm/amdgpu/gfx12: add ring reset callbacks
>    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>
> Jiadong Zhu (13):
>    drm/amdgpu/gfx11: wait for reset done before remap
>    drm/amdgpu/gfx10: remap queue after reset successfully
>    drm/amdgpu/gfx10: wait for reset done before remap
>    drm/amdgpu/gfx9: remap queue after reset successfully
>    drm/amdgpu/gfx9: wait for reset done before remap
>    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>    drm/amdgpu/mes: modify mes api for mmio queue reset
>    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>    drm/amdgpu/mes11: implement mmio queue reset for gfx11
>
> Prike Liang (2):
>    drm/amdgpu: increase the reset counter for the queue reset
>    drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>   14 files changed, 930 insertions(+), 32 deletions(-)
>





More information about the amd-gfx mailing list