[PATCH 1/2] drm/amdgpu: enable GPU reset by default on Navi
Pierre-Eric Pelloux-Prayer
pierre-eric.pelloux-prayer at amd.com
Thu Jan 30 11:15:17 UTC 2020
Hi Alex,
I had one issue while testing this patch on a RX5700.
After a gfx hang a reset is executed.
Switching to a VT and restarting gdm works fine but the clocks seem messed up:
- lots of graphical artifcats (underflows?)
- pp_dpm_sclk and pp_dpm_socclk have strange values (see attached files)
dmesg output (from the gfx hang):
[ 169.755071] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 169.755173] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 174.874847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=10034, emitted seq=10036
[ 174.874925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1724 thread gnome-shel:cs0 pid 1741
[ 174.874933] amdgpu 0000:0b:00.0: GPU reset begin!
[ 174.875192] ------------[ cut here ]------------
[ 174.875282] WARNING: CPU: 0 PID: 7 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2969 dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[ 174.875283] Modules linked in: binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_nhlt(E) snd_hda_codec(E) snd_hwdep(E) efi_pstore(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) snd(E) ccp(E) xpad(E) wmi_bmof(E) mxm_wmi(E) evdev(E) joydev(E) ff_memless(E) efivars(E) soundcore(E) pcspkr(E) k10temp(E) sp5100_tco(E) rng_core(E) sg(E) wmi(E) button(E) acpi_cpufreq(E) uinput(E) parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) dm_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) amdgpu(E) gpu_sched(E) ttm(E) drm_kms_helper(E) aesni_intel(E) glue_helper(E) xhci_pci(E) ahci(E) crypto_simd(E) libahci(E) cryptd(E) drm(E) xhci_hcd(E) i2c_piix4(E) libata(E) igb(E) usbcore(E) scsi_mod(E) dca(E) i2c_algo_bit(E) gpio_amdpt(E)
[ 174.875310] gpio_generic(E)
[ 174.875313] CPU: 0 PID: 7 Comm: kworker/0:1 Tainted: G E 5.4.0-rc7-02679-g9d664d914f0e #10
[ 174.875314] Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS ULTRA GAMING/X470 AORUS ULTRA GAMING-CF, BIOS F6 01/25/2019
[ 174.875318] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 174.875404] RIP: 0010:dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[ 174.875406] Code: 2d 44 22 a5 e8 1d 00 00 75 26 f2 0f 11 85 a8 21 00 00 31 d2 48 89 ee 4c 89 ef e8 d4 f5 ff ff 41 89 c4 22 85 e8 1d 00 00 75 4a <0f> 0b eb 02 75 d1 f2 0f 10 14 24 f2 0f 11 95 a8 21 00 00 e8 f1 4b
[ 174.875407] RSP: 0018:ffffa87880067a90 EFLAGS: 00010246
[ 174.875408] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001e61
[ 174.875409] RDX: 0000000000001e60 RSI: ffff981f3ec2d880 RDI: 000000000002d880
[ 174.875409] RBP: ffff981f25a60000 R08: 0000000000000006 R09: 0000000000000000
[ 174.875410] R10: 0000000100000000 R11: 0000000100000001 R12: 0000000000000001
[ 174.875411] R13: ffff981f1af40000 R14: 0000000000000000 R15: ffff981f25a60000
[ 174.875412] FS: 0000000000000000(0000) GS:ffff981f3ec00000(0000) knlGS:0000000000000000
[ 174.875413] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 174.875414] CR2: 00007f42858f3000 CR3: 00000007f02ae000 CR4: 00000000003406f0
[ 174.875414] Call Trace:
[ 174.875498] dc_validate_global_state+0x25f/0x2d0 [amdgpu]
[ 174.875583] amdgpu_dm_atomic_check+0x8ff/0xf20 [amdgpu]
[ 174.875587] ? __ww_mutex_lock.isra.0+0x3a/0x760
[ 174.875590] ? _cond_resched+0x15/0x30
[ 174.875591] ? __ww_mutex_lock.isra.0+0x3a/0x760
[ 174.875606] drm_atomic_check_only+0x554/0x7e0 [drm]
[ 174.875620] ? drm_connector_list_iter_next+0x7d/0x90 [drm]
[ 174.875632] drm_atomic_commit+0x13/0x50 [drm]
[ 174.875640] drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[ 174.875647] drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper]
[ 174.875730] dm_suspend+0x1c/0x60 [amdgpu]
[ 174.875782] amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu]
[ 174.875836] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[ 174.875923] amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[ 174.876010] amdgpu_device_gpu_recover.cold+0x43a/0xbca [amdgpu]
[ 174.876084] amdgpu_job_timedout+0x103/0x130 [amdgpu]
[ 174.876088] drm_sched_job_timedout+0x7f/0xe0 [gpu_sched]
[ 174.876092] process_one_work+0x1b5/0x360
[ 174.876094] worker_thread+0x50/0x3c0
[ 174.876096] kthread+0xf9/0x130
[ 174.876097] ? process_one_work+0x360/0x360
[ 174.876099] ? kthread_park+0x90/0x90
[ 174.876100] ret_from_fork+0x22/0x40
[ 174.876103] ---[ end trace af4365804bf318ce ]---
[ 175.346937] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[ 175.600179] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 175.853418] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx
[ 175.874639] [drm] psp command (0x2) failed and response status is (0x117)
[ 178.963249] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[ 178.963364] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 178.964227] [drm] PSP is resuming...
[ 179.130413] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
[ 179.130492] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
[ 179.138589] [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
[ 179.330602] amdgpu 0000:0b:00.0: RAS: ras ta ucode is not available
[ 179.354974] amdgpu: [powerplay] SMU is resuming...
[ 179.357614] amdgpu: [powerplay] dpm has been disabled
[ 179.357617] amdgpu: [powerplay] SMU is resumed successfully!
[ 179.649168] [drm] kiq ring mec 2 pipe 1 q 0
[ 179.663833] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 179.663872] [drm] JPEG decode initialized successfully.
[ 179.663877] amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 179.663878] amdgpu 0000:0b:00.0: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[ 179.663880] amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[ 179.663881] amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[ 179.663882] amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 179.663883] amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 179.663884] amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 179.663886] amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 179.663887] amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 179.663888] amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 179.663889] amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 12 on hub 0
[ 179.663890] amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 13 on hub 0
[ 179.663891] amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 14 on hub 0
[ 179.663892] amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
[ 179.663893] amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
[ 179.663894] amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
[ 179.663895] amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
[ 179.675804] [drm] recover vram bo from shadow start
[ 179.677342] [drm] recover vram bo from shadow done
[ 179.677344] [drm] Skip scheduling IBs!
[ 179.677345] [drm] Skip scheduling IBs!
[ 179.677357] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
[ 179.677367] [drm] Skip scheduling IBs!
[ 179.677372] [drm] Skip scheduling IBs!
[ 179.677375] [drm] Skip scheduling IBs!
[ 179.677378] [drm] Skip scheduling IBs!
[ 179.677380] [drm] Skip scheduling IBs!
[ 179.677382] [drm] Skip scheduling IBs!
[ 179.677383] [drm] Skip scheduling IBs!
Pierre-Eric
On 27/01/2020 20:37, Alex Deucher wrote:
> Has been working fine for a while.
>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1da03658891c..69248d1b2417 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3762,6 +3762,9 @@ bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev)
> case CHIP_VEGA12:
> case CHIP_RAVEN:
> case CHIP_ARCTURUS:
> + case CHIP_NAVI10:
> + case CHIP_NAVI14:
> + case CHIP_NAVI12:
> break;
> default:
> goto disabled;
>
-------------- next part --------------
BEFORE reset
------------
# head pp_dpm_*
==> pp_dpm_dcefclk <==
0: 506Mhz *
1: 886Mhz
2: 1266Mhz
==> pp_dpm_fclk <==
0: 506Mhz
1: 950Mhz *
2: 1266Mhz
==> pp_dpm_mclk <==
0: 100Mhz
1: 500Mhz
2: 625Mhz
3: 875Mhz *
==> pp_dpm_pcie <==
0: 2.5GT/s, x16 619Mhz
1: 8.0GT/s, x16 619Mhz *
==> pp_dpm_sclk <==
0: 300Mhz
1: 800Mhz *
2: 1750Mhz
==> pp_dpm_socclk <==
0: 506Mhz
1: 950Mhz *
2: 1266Mhz
AFTER reset
-----------
# head pp_dpm_*
==> pp_dpm_dcefclk <==
0: 506Mhz *
1: 886Mhz
2: 1266Mhz
==> pp_dpm_fclk <==
0: 506Mhz *
1: 886Mhz
2: 1266Mhz
==> pp_dpm_mclk <==
==> pp_dpm_pcie <==
0: 2.5GT/s, x16 619Mhz
1: 8.0GT/s, x16 619Mhz *
==> pp_dpm_sclk <==
0: 0Mhz
1: 800Mhz *
2: 0Mhz
==> pp_dpm_socclk <==
0: 0Mhz
1: 506Mhz *
2: 0Mhz
More information about the amd-gfx
mailing list