[PATCH 1/2] drm/amdgpu: enable GPU reset by default on Navi

Thu Jan 30 11:15:17 UTC 2020

Hi Alex,

I had one issue while testing this patch on a RX5700.

After a gfx hang a reset is executed.
Switching to a VT and restarting gdm works fine but the clocks seem messed up:
   - lots of graphical artifcats (underflows?)
   - pp_dpm_sclk and pp_dpm_socclk have strange values (see attached files)

dmesg output (from the gfx hang):
[  169.755071] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[  169.755173] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[  174.874847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=10034, emitted seq=10036
[  174.874925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1724 thread gnome-shel:cs0 pid 1741
[  174.874933] amdgpu 0000:0b:00.0: GPU reset begin!
[  174.875192] ------------[ cut here ]------------
[  174.875282] WARNING: CPU: 0 PID: 7 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2969 dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[  174.875283] Modules linked in: binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_nhlt(E) snd_hda_codec(E) snd_hwdep(E) efi_pstore(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) snd(E) ccp(E) xpad(E) wmi_bmof(E) mxm_wmi(E) evdev(E) joydev(E) ff_memless(E) efivars(E) soundcore(E) pcspkr(E) k10temp(E) sp5100_tco(E) rng_core(E) sg(E) wmi(E) button(E) acpi_cpufreq(E) uinput(E) parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) dm_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) amdgpu(E) gpu_sched(E) ttm(E) drm_kms_helper(E) aesni_intel(E) glue_helper(E) xhci_pci(E) ahci(E) crypto_simd(E) libahci(E) cryptd(E) drm(E) xhci_hcd(E) i2c_piix4(E) libata(E) igb(E) usbcore(E) scsi_mod(E) dca(E) i2c_algo_bit(E) gpio_amdpt(E)
[  174.875310]  gpio_generic(E)
[  174.875313] CPU: 0 PID: 7 Comm: kworker/0:1 Tainted: G            E     5.4.0-rc7-02679-g9d664d914f0e #10
[  174.875314] Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS ULTRA GAMING/X470 AORUS ULTRA GAMING-CF, BIOS F6 01/25/2019
[  174.875318] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  174.875404] RIP: 0010:dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[  174.875406] Code: 2d 44 22 a5 e8 1d 00 00 75 26 f2 0f 11 85 a8 21 00 00 31 d2 48 89 ee 4c 89 ef e8 d4 f5 ff ff 41 89 c4 22 85 e8 1d 00 00 75 4a <0f> 0b eb 02 75 d1 f2 0f 10 14 24 f2 0f 11 95 a8 21 00 00 e8 f1 4b
[  174.875407] RSP: 0018:ffffa87880067a90 EFLAGS: 00010246
[  174.875408] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001e61
[  174.875409] RDX: 0000000000001e60 RSI: ffff981f3ec2d880 RDI: 000000000002d880
[  174.875409] RBP: ffff981f25a60000 R08: 0000000000000006 R09: 0000000000000000
[  174.875410] R10: 0000000100000000 R11: 0000000100000001 R12: 0000000000000001
[  174.875411] R13: ffff981f1af40000 R14: 0000000000000000 R15: ffff981f25a60000
[  174.875412] FS:  0000000000000000(0000) GS:ffff981f3ec00000(0000) knlGS:0000000000000000
[  174.875413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  174.875414] CR2: 00007f42858f3000 CR3: 00000007f02ae000 CR4: 00000000003406f0
[  174.875414] Call Trace:
[  174.875498]  dc_validate_global_state+0x25f/0x2d0 [amdgpu]
[  174.875583]  amdgpu_dm_atomic_check+0x8ff/0xf20 [amdgpu]
[  174.875587]  ? __ww_mutex_lock.isra.0+0x3a/0x760
[  174.875590]  ? _cond_resched+0x15/0x30
[  174.875591]  ? __ww_mutex_lock.isra.0+0x3a/0x760
[  174.875606]  drm_atomic_check_only+0x554/0x7e0 [drm]
[  174.875620]  ? drm_connector_list_iter_next+0x7d/0x90 [drm]
[  174.875632]  drm_atomic_commit+0x13/0x50 [drm]
[  174.875640]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[  174.875647]  drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper]
[  174.875730]  dm_suspend+0x1c/0x60 [amdgpu]
[  174.875782]  amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu]
[  174.875836]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  174.875923]  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[  174.876010]  amdgpu_device_gpu_recover.cold+0x43a/0xbca [amdgpu]
[  174.876084]  amdgpu_job_timedout+0x103/0x130 [amdgpu]
[  174.876088]  drm_sched_job_timedout+0x7f/0xe0 [gpu_sched]
[  174.876092]  process_one_work+0x1b5/0x360
[  174.876094]  worker_thread+0x50/0x3c0
[  174.876096]  kthread+0xf9/0x130
[  174.876097]  ? process_one_work+0x360/0x360
[  174.876099]  ? kthread_park+0x90/0x90
[  174.876100]  ret_from_fork+0x22/0x40
[  174.876103] ---[ end trace af4365804bf318ce ]---
[  175.346937] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[  175.600179] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  175.853418] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx
[  175.874639] [drm] psp command (0x2) failed and response status is (0x117)
[  178.963249] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  178.963364] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  178.964227] [drm] PSP is resuming...
[  179.130413] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
[  179.130492] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
[  179.138589] [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
[  179.330602] amdgpu 0000:0b:00.0: RAS: ras ta ucode is not available
[  179.354974] amdgpu: [powerplay] SMU is resuming...
[  179.357614] amdgpu: [powerplay] dpm has been disabled
[  179.357617] amdgpu: [powerplay] SMU is resumed successfully!
[  179.649168] [drm] kiq ring mec 2 pipe 1 q 0
[  179.663833] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  179.663872] [drm] JPEG decode initialized successfully.
[  179.663877] amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  179.663878] amdgpu 0000:0b:00.0: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[  179.663880] amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[  179.663881] amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[  179.663882] amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[  179.663883] amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[  179.663884] amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[  179.663886] amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[  179.663887] amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[  179.663888] amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[  179.663889] amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 12 on hub 0
[  179.663890] amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 13 on hub 0
[  179.663891] amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 14 on hub 0
[  179.663892] amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
[  179.663893] amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
[  179.663894] amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
[  179.663895] amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
[  179.675804] [drm] recover vram bo from shadow start
[  179.677342] [drm] recover vram bo from shadow done
[  179.677344] [drm] Skip scheduling IBs!
[  179.677345] [drm] Skip scheduling IBs!
[  179.677357] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
[  179.677367] [drm] Skip scheduling IBs!
[  179.677372] [drm] Skip scheduling IBs!
[  179.677375] [drm] Skip scheduling IBs!
[  179.677378] [drm] Skip scheduling IBs!
[  179.677380] [drm] Skip scheduling IBs!
[  179.677382] [drm] Skip scheduling IBs!
[  179.677383] [drm] Skip scheduling IBs!

Pierre-Eric

On 27/01/2020 20:37, Alex Deucher wrote:
> Has been working fine for a while.
> 
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1da03658891c..69248d1b2417 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3762,6 +3762,9 @@ bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev)
>  		case CHIP_VEGA12:
>  		case CHIP_RAVEN:
>  		case CHIP_ARCTURUS:
> +		case CHIP_NAVI10:
> +		case CHIP_NAVI14:
> +		case CHIP_NAVI12:
>  			break;
>  		default:
>  			goto disabled;
> 
-------------- next part --------------
BEFORE reset
------------

# head pp_dpm_*
==> pp_dpm_dcefclk <==
0: 506Mhz *
1: 886Mhz 
2: 1266Mhz 

==> pp_dpm_fclk <==
0: 506Mhz 
1: 950Mhz *
2: 1266Mhz 

==> pp_dpm_mclk <==
0: 100Mhz 
1: 500Mhz 
2: 625Mhz 
3: 875Mhz *

==> pp_dpm_pcie <==
0: 2.5GT/s, x16 619Mhz 
1: 8.0GT/s, x16 619Mhz *

==> pp_dpm_sclk <==
0: 300Mhz 
1: 800Mhz *
2: 1750Mhz 

==> pp_dpm_socclk <==
0: 506Mhz 
1: 950Mhz *
2: 1266Mhz 

AFTER reset
-----------
# head pp_dpm_*
==> pp_dpm_dcefclk <==
0: 506Mhz *
1: 886Mhz 
2: 1266Mhz 

==> pp_dpm_fclk <==
0: 506Mhz *
1: 886Mhz 
2: 1266Mhz 

==> pp_dpm_mclk <==

==> pp_dpm_pcie <==
0: 2.5GT/s, x16 619Mhz 
1: 8.0GT/s, x16 619Mhz *

==> pp_dpm_sclk <==
0: 0Mhz 
1: 800Mhz *
2: 0Mhz 

==> pp_dpm_socclk <==
0: 0Mhz 
1: 506Mhz *
2: 0Mhz