[PATCH 00/34] GC per queue reset
Christopher Snowhill
chris at kode54.net
Tue Jul 23 08:50:18 UTC 2024
Alex Deucher <alexdeucher at gmail.com> writes:
> On Thu, Jul 18, 2024 at 10:15 AM Alex Deucher <alexander.deucher at amd.com> wrote:
>>
>> This adds preliminary support for GC per queue reset. In this
>> case, only the jobs currently in the queue are lost. If this
>> fails, we fall back to a full adapter reset.
>
> Also available here via git:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset
Just tested this, after encountering the double-add crash trying to
reset after a GPU hang. It doesn't seem to gracefully recover from this
particular GPU hang, but at least now it resets properly. Still not
going to attempt to run it against KDE / Plasma 6.1.3 on Arch, as that
loves to hang if there's any Xwayland involved in the GPU reset event.
However, under labwc-git with my own PR applied to it, it recovers okay,
though Xwayland eventually crashes and is restarted by labwc. Here's a
dmesg log excerpt of the reset and recovery event:
[ 189.830630] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=52410, emitted seq=52412
[ 189.830642] amdgpu 0000:0a:00.0: amdgpu: Process information: process Stray-Win64-Shi pid 11560 thread vkd3d_queue pid 11719
[ 190.099191] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
[ 190.457702] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State
[ 190.459418] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed
[ 190.459420] amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
[ 190.459423] amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
[ 190.459483] amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
[ 190.967464] amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 190.967824] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 190.967912] [drm] VRAM is lost due to GPU reset!
[ 190.967914] amdgpu 0000:0a:00.0: amdgpu: PSP is resuming...
[ 191.042264] amdgpu 0000:0a:00.0: amdgpu: reserve 0xa00000 from 0x82fd000000 for PSP TMR
[ 191.143003] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 191.156566] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 191.156572] amdgpu 0000:0a:00.0: amdgpu: SMU is resuming...
[ 191.156576] amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413e00 (65.62.0)
[ 191.156580] amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
[ 191.156609] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
[ 191.215750] amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully!
[ 191.217023] [drm] DMUB hardware initialized: version=0x02020020
[ 191.530005] [drm] kiq ring mec 2 pipe 1 q 0
[ 191.532863] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 191.532866] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[ 191.532867] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[ 191.532869] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[ 191.532870] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 191.532871] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 191.532872] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 191.532874] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 191.532875] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 191.532876] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 191.532878] amdgpu 0000:0a:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[ 191.532879] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[ 191.532880] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[ 191.532881] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[ 191.532883] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[ 191.532884] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[ 191.532885] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 191.536522] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow start
[ 191.555443] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done
[ 191.555471] amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded!
[ 191.555663] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Yes, I can reliably hang my gfx ring if I run Stray with -dx12 switch
applied. In-game, though, not on the title screen.
> Alex
>
>>
>> Alex Deucher (19):
>> drm/amdgpu/mes: add API for legacy queue reset
>> drm/amdgpu/mes11: add API for legacy queue reset
>> drm/amdgpu/mes12: add API for legacy queue reset
>> drm/amdgpu/mes: add API for user queue reset
>> drm/amdgpu/mes11: add API for user queue reset
>> drm/amdgpu/mes12: add API for user queue reset
>> drm/amdgpu: add new ring reset callback
>> drm/amdgpu: add per ring reset support (v2)
>> drm/amdgpu/gfx11: add ring reset callbacks
>> drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>> drm/amdgpu/gfx10: add ring reset callbacks
>> drm/amdgpu/gfx10: rework reset sequence
>> drm/amdgpu/gfx9: add ring reset callback
>> drm/amdgpu/gfx9.4.3: add ring reset callback
>> drm/amdgpu/gfx12: add ring reset callbacks
>> drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>> drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>> drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>> drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>>
>> Jiadong Zhu (13):
>> drm/amdgpu/gfx11: wait for reset done before remap
>> drm/amdgpu/gfx10: remap queue after reset successfully
>> drm/amdgpu/gfx10: wait for reset done before remap
>> drm/amdgpu/gfx9: remap queue after reset successfully
>> drm/amdgpu/gfx9: wait for reset done before remap
>> drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>> drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>> drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>> drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>> drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>> drm/amdgpu/mes: modify mes api for mmio queue reset
>> drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>> drm/amdgpu/mes11: implement mmio queue reset for gfx11
>>
>> Prike Liang (2):
>> drm/amdgpu: increase the reset counter for the queue reset
>> drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +
>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158 ++++++++++++++++++++-
>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++--
>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 +
>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++-
>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++-
>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125 +++++++++++++++-
>> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132 +++++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++
>> 14 files changed, 930 insertions(+), 32 deletions(-)
>>
>> --
>> 2.45.2
>>
More information about the amd-gfx
mailing list