[PATCH] drm/amdgpu: Fix gfx10 kiq ring_lock warning on full reset
Alex Deucher
alexdeucher at gmail.com
Mon Jul 22 19:36:31 UTC 2024
On Mon, Jul 22, 2024 at 4:16 AM Jesse Zhang <jesse.zhang at amd.com> wrote:
>
> Fix warning about kiq ring.
> Unlock kiq ring when queue reset fails.
>
> [ 285.999224] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
> [ 312.018425] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! [kworker/u64:2:878]
> [ 312.018428] Modules linked in: amdgpu(E) amdxcp drm_exec gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper ttm drm_display_helper cec rc_core drm_kms_helper i2c_algo_bit rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs xt_conntrack nft_chain_nat r8153_ecm cdc_ether usbnet xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_multiport xt_addrtype nft_compat nf_tables br_netfilter libcrc32c nfnetlink bridge stp llc r8152 mii joydev input_leds overlay snd_hda_codec_hdmi edac_mce_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_hda_codec snd_hda_core snd_hwdep kvm hid_generic snd_pcm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_seq_midi snd_seq_midi_event snd_rawmidi crypto_simd usbhid cryptd hid snd_seq snd_pci_acp5x snd_seq_device snd_timer snd_rn_pci_acp3x rapl snd_acp_config snd_soc_acpi snd ccp snd_pci_acp3x wmi_bmof soundcore k10temp mac_hid sunrpc binfmt_misc sch_fq_codel msr parport_pc
> [ 312.018466] ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 ucsi_ccg typec_ucsi typec nvme crc32_pclmul nvme_core xhci_pci i2c_designware_pci i2c_piix4 xhci_pci_renesas i2c_ccgx_ucsi video wmi
> [ 312.018475] CPU: 11 PID: 878 Comm: kworker/u64:2 Tainted: G E 6.8.0+ #171
> [ 312.018477] Hardware name: AMD Splinter/Splinter-GNR, BIOS WS54117N_140 01/16/2024
> [ 312.018478] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
> [ 312.018485] RIP: 0010:native_queued_spin_lock_slowpath+0x88/0x300
> [ 312.018490] Code: 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 41 8b 04 24 30 e4 09 d0 a9 00 01 ff ff 75 5e 85 c0 74 14 41 0f b6 04 24 84 c0 74 0b f3 90 <41> 0f b6 04 24 84 c0 75 f5 b8 01 00 00 00 66 41 89 04 24 5b 41 5c
> [ 312.018492] RSP: 0018:ffffa327c0de7b80 EFLAGS: 00000202
> [ 312.018493] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> [ 312.018494] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ab913e16cf8
> [ 312.018495] RBP: ffffa327c0de7ba8 R08: 0000000000000000 R09: fffffa4007040000
> [ 312.018495] R10: ffffa327c0de7bb8 R11: 0000000000000040 R12: ffff8ab913e16cf8
> [ 312.018496] R13: ffff8ab913e00000 R14: ffff8ab913e00000 R15: ffff8ab913e00000
> [ 312.018497] FS: 0000000000000000(0000) GS:ffff8ab9956c0000(0000) knlGS:0000000000000000
> [ 312.018498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 312.018498] CR2: 00007f44b24d319c CR3: 000000023b83c000 CR4: 0000000000750ef0
> [ 312.018499] PKRU: 55555554
> [ 312.018500] Call Trace:
> [ 312.018501] <IRQ>
> [ 312.018504] ? show_regs+0x6c/0x80
> [ 312.018508] ? watchdog_timer_fn+0x206/0x290
> [ 312.018511] ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 312.018513] ? __hrtimer_run_queues+0xc8/0x220
> [ 312.018517] ? hrtimer_interrupt+0x10d/0x250
> [ 312.018519] ? __sysvec_apic_timer_interrupt+0x51/0x130
> [ 312.018522] ? sysvec_apic_timer_interrupt+0x7f/0x90
> [ 312.018525] </IRQ>
> [ 312.018525] <TASK>
> [ 312.018526] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> [ 312.018529] ? native_queued_spin_lock_slowpath+0x88/0x300
> [ 312.018530] _raw_spin_lock+0x2d/0x40
> [ 312.018532] amdgpu_gfx_disable_kgq+0x6f/0x1d0 [amdgpu]
> [ 312.018646] gfx_v10_0_hw_fini+0x111/0x130 [amdgpu]
> [ 312.018742] gfx_v10_0_suspend+0x12/0x20 [amdgpu]
> [ 312.018832] amdgpu_device_ip_suspend_phase2+0x244/0x470 [amdgpu]
> [ 312.018909] amdgpu_device_ip_suspend+0x4b/0x90 [amdgpu]
> [ 312.018989] amdgpu_device_pre_asic_reset+0xda/0x4b0 [amdgpu]
> [ 312.019068] amdgpu_device_gpu_recover+0x319/0xe20 [amdgpu]
> [ 312.019147] amdgpu_job_timedout+0x177/0x280 [amdgpu]
> [ 312.019266] drm_sched_job_timedout+0x7c/0x100 [gpu_sched]
> [ 312.019269] process_scheduled_works+0x9a/0x3a0
> [ 312.019272] ? __pfx_worker_thread+0x10/0x10
> [ 312.019273] worker_thread+0x15f/0x2d0
> [ 312.019275] ? __pfx_worker_thread+0x10/0x10
> [ 312.019276] kthread+0xfb/0x130
> [ 312.019277] ? __pfx_kthread+0x10/0x10
> [ 312.019278] ret_from_fork+0x3d/0x60
> [ 312.019280] ? __pfx_kthread+0x10/0x10
> [ 312.019281] ret_from_fork_asm+0x1b/0x30
> [ 312.019284] </TASK>
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> Signed-off-by: Jesse Zhang <Jesse.Zhang at amd.com>
Good catch. I've squashed this into the appropriate patch and pushed
and updated branch here:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset
Alex
> ---
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index fde11159270c..59024fbf8c22 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -9478,6 +9478,7 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring,
> 0, 0);
> amdgpu_ring_commit(kiq_ring);
>
> + spin_unlock_irqrestore(&kiq->ring_lock, flags);
> r = amdgpu_ring_test_ring(kiq_ring);
> if (r)
> return r;
> @@ -9530,8 +9531,6 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring,
> if (r)
> return r;
>
> - spin_unlock_irqrestore(&kiq->ring_lock, flags);
> -
> return amdgpu_ring_test_ring(ring);
> }
>
> --
> 2.25.1
>
More information about the amd-gfx
mailing list