[PATCH 00/34] GC per queue reset

Friedrich Vock friedrich.vock at gmx.de
Thu Jul 25 07:44:27 UTC 2024


On 24.07.24 11:20, Zhu, Jiadong wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Alex
>> Deucher
>> Sent: Friday, July 19, 2024 9:40 PM
>> To: Friedrich Vock <friedrich.vock at gmx.de>
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; amd-
>> gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 00/34] GC per queue reset
>>
>> On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock at gmx.de>
>> wrote:
>>>
>>> Hi,
>>>
>>> On 18.07.24 16:06, Alex Deucher wrote:
>>>> This adds preliminary support for GC per queue reset.  In this case,
>>>> only the jobs currently in the queue are lost.  If this fails, we
>>>> fall back to a full adapter reset.
>>>
>>> First of all, thank you so much for working on this! It's great to
>>> finally see progress in making GPU resets better.
>>>
>>> I've just taken this patchset (together with your other
>>> patchsets[1][2][3]) for a quick spin on my
>>> Navi21 with the GPU reset tests[4] I had written a while ago - the
>>> current patchset sadly seems to have some regressions WRT recovery
>> there.
>>>
>>> I ran the tests under my Plasma Wayland session once - this triggered
>>> a list double-add in drm_sched_stop (calltrace follows):
>>
>> I think this should fix the double add:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 7107c4d3a3b6..555d3b671bdb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>                                  drm_sched_start(&ring->sched, true);
>>                          goto exit;
>>                  }
>> +               if (amdgpu_ring_sched_ready(ring))
>> +                       drm_sched_start(&ring->sched, true);
>>          }
>>
>>          if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>
>>
>>>
>>> ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434
>>> arch/x86/kernel/dumpstack.c:447) ? do_trap
>>> (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ?
>>> __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
>>> do_error_trap (./arch/x86/include/asm/traps.h:58
>>> arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report
>>> (lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op
>>> (arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report
>>> (lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op
>>> (./arch/x86/include/asm/idtentry.h:568)
>>> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
>>> __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
>>> drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169
>>> drivers/gpu/drm/scheduler/sched_main.c:617)
>>> amdgpu_device_gpu_recover
>>> (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
>>> amdgpu_job_timedout
>> (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
>>> drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
>>> process_one_work (kernel/workqueue.c:2633) worker_thread
>>> (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787
>>> (discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733)
>>> kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341)
>>> ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread
>>> (kernel/kthread.c:341) ret_from_fork_asm
>>> (arch/x86/entry/entry_64.S:251)
>>>
>>> When running the tests without a desktop environment active, the
>>> double-add disappeared, but the GPU reset still didn't go well - the
>>> TTY remained frozen and the kernel log contained a few messages like:
>>>
>>> [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out
>
> Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu.
> could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks.

Hi,

I double-checked with the iGPU disabled in BIOS and can still reproduce.
In case it matters, note that I had a typo in my original message: I'm
testing on Navi22, not 21 - sorry about that.

Also, it seems like the issue also occurs on normal
amd-staging-drm-next without the per-queue reset patches, so this
actually an earlier, unrelated regression.

I'll try bisecting later and will open a separate GitLab issue for this.

Regards,
Friedrich

>
> Thanks,
> Jiadong
>
>> I don't think the display hardware is hung, I think it's a fence signalling issue
>> after the reset.  We are investigating some limitations we are seeing in the
>> handling of fences.
>>
>>>
>>> which I guess means at least the display subsystem is hung.
>>>
>>> Hope this info is enough to repro/investigate.
>>
>> Thanks for testing!
>>
>> Alex
>>
>>>
>>> Thanks,
>>> Friedrich
>>>
>>> [1]
>>> https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch
>>> er at amd.com/T/#t [2]
>>> https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch
>>> er at amd.com/T/#t [3]
>>> https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-
>> 2e1e5788344f at a
>>> md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite
>>>
>>>
>>>>
>>>> Alex Deucher (19):
>>>>     drm/amdgpu/mes: add API for legacy queue reset
>>>>     drm/amdgpu/mes11: add API for legacy queue reset
>>>>     drm/amdgpu/mes12: add API for legacy queue reset
>>>>     drm/amdgpu/mes: add API for user queue reset
>>>>     drm/amdgpu/mes11: add API for user queue reset
>>>>     drm/amdgpu/mes12: add API for user queue reset
>>>>     drm/amdgpu: add new ring reset callback
>>>>     drm/amdgpu: add per ring reset support (v2)
>>>>     drm/amdgpu/gfx11: add ring reset callbacks
>>>>     drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>>>>     drm/amdgpu/gfx10: add ring reset callbacks
>>>>     drm/amdgpu/gfx10: rework reset sequence
>>>>     drm/amdgpu/gfx9: add ring reset callback
>>>>     drm/amdgpu/gfx9.4.3: add ring reset callback
>>>>     drm/amdgpu/gfx12: add ring reset callbacks
>>>>     drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>>>>     drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>>>>     drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>>>>     drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>>>>
>>>> Jiadong Zhu (13):
>>>>     drm/amdgpu/gfx11: wait for reset done before remap
>>>>     drm/amdgpu/gfx10: remap queue after reset successfully
>>>>     drm/amdgpu/gfx10: wait for reset done before remap
>>>>     drm/amdgpu/gfx9: remap queue after reset successfully
>>>>     drm/amdgpu/gfx9: wait for reset done before remap
>>>>     drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>>>>     drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>>>>     drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for
>> reset_hw_queue
>>>>     drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>>>>     drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>>>>     drm/amdgpu/mes: modify mes api for mmio queue reset
>>>>     drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>>>>     drm/amdgpu/mes11: implement mmio queue reset for gfx11
>>>>
>>>> Prike Liang (2):
>>>>     drm/amdgpu: increase the reset counter for the queue reset
>>>>     drm/amdgpu/gfx11: fallback to driver reset compute queue directly
>>>> (v2)
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158
>> ++++++++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125
>> +++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132
>> +++++++++++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>>>>    14 files changed, 930 insertions(+), 32 deletions(-)
>>>>
>>>
>>>
>>>



More information about the amd-gfx mailing list