[PATCH 00/34] GC per queue reset

Fri Jul 19 13:39:37 UTC 2024

On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock at gmx.de> wrote:
>
> Hi,
>
> On 18.07.24 16:06, Alex Deucher wrote:
> > This adds preliminary support for GC per queue reset.  In this
> > case, only the jobs currently in the queue are lost.  If this
> > fails, we fall back to a full adapter reset.
>
> First of all, thank you so much for working on this! It's great to
> finally see progress in making GPU resets better.
>
> I've just taken this patchset (together with your other
> patchsets[1][2][3]) for a quick spin on my
> Navi21 with the GPU reset tests[4] I had written a while ago - the
> current patchset sadly seems to have some regressions WRT recovery there.
>
> I ran the tests under my Plasma Wayland session once - this triggered a
> list double-add in drm_sched_stop (calltrace follows):

I think this should fix the double add:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7107c4d3a3b6..555d3b671bdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
amdgpu_job_timedout(struct drm_sched_job *s_job)
                                drm_sched_start(&ring->sched, true);
                        goto exit;
                }
+               if (amdgpu_ring_sched_ready(ring))
+                       drm_sched_start(&ring->sched, true);
        }

        if (amdgpu_device_should_recover_gpu(ring->adev)) {


>
> ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? exc_invalid_op (arch/x86/kernel/traps.c:266)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
> amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
> amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
> drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
> process_one_work (kernel/workqueue.c:2633)
> worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
> ? __pfx_worker_thread (kernel/workqueue.c:2733)
> kthread (kernel/kthread.c:388)
> ? __pfx_kthread (kernel/kthread.c:341)
> ret_from_fork (arch/x86/kernel/process.c:147)
> ? __pfx_kthread (kernel/kthread.c:341)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:251)
>
> When running the tests without a desktop environment active, the
> double-add disappeared, but the GPU reset still didn't go well - the TTY
> remained frozen and the kernel log contained a few messages like:
>
> [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

I don't think the display hardware is hung, I think it's a fence
signalling issue after the reset.  We are investigating some
limitations we are seeing in the handling of fences.

>
> which I guess means at least the display subsystem is hung.
>
> Hope this info is enough to repro/investigate.

Thanks for testing!

Alex

>
> Thanks,
> Friedrich
>
> [1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@amd.com/T/#t
> [2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@amd.com/T/#t
> [3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@amd.com/T/#t
> [4] https://gitlab.steamos.cloud/holo/HangTestSuite
>
>
> >
> > Alex Deucher (19):
> >    drm/amdgpu/mes: add API for legacy queue reset
> >    drm/amdgpu/mes11: add API for legacy queue reset
> >    drm/amdgpu/mes12: add API for legacy queue reset
> >    drm/amdgpu/mes: add API for user queue reset
> >    drm/amdgpu/mes11: add API for user queue reset
> >    drm/amdgpu/mes12: add API for user queue reset
> >    drm/amdgpu: add new ring reset callback
> >    drm/amdgpu: add per ring reset support (v2)
> >    drm/amdgpu/gfx11: add ring reset callbacks
> >    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
> >    drm/amdgpu/gfx10: add ring reset callbacks
> >    drm/amdgpu/gfx10: rework reset sequence
> >    drm/amdgpu/gfx9: add ring reset callback
> >    drm/amdgpu/gfx9.4.3: add ring reset callback
> >    drm/amdgpu/gfx12: add ring reset callbacks
> >    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
> >    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
> >    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
> >    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
> >
> > Jiadong Zhu (13):
> >    drm/amdgpu/gfx11: wait for reset done before remap
> >    drm/amdgpu/gfx10: remap queue after reset successfully
> >    drm/amdgpu/gfx10: wait for reset done before remap
> >    drm/amdgpu/gfx9: remap queue after reset successfully
> >    drm/amdgpu/gfx9: wait for reset done before remap
> >    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
> >    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
> >    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
> >    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
> >    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
> >    drm/amdgpu/mes: modify mes api for mmio queue reset
> >    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
> >    drm/amdgpu/mes11: implement mmio queue reset for gfx11
> >
> > Prike Liang (2):
> >    drm/amdgpu: increase the reset counter for the queue reset
> >    drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
> >
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
> >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
> >   14 files changed, 930 insertions(+), 32 deletions(-)
> >
>
>
>