[REGRESSION] rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

Alexey Klimov alexey.klimov at linaro.org
Thu Oct 26 17:33:20 UTC 2023


#regzbot introduced: 1cfb4d612127
#regzbot title: rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

Hi all,

I've been playing with RX7600 and it was observed that amdgpu stopped working between kernel 6.2 and 6.5.
Then I narrowed it down to 6.4 <-> 6.5-rc1 and finally bisect pointed at 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21
And I manually checked if it boots/works on the previous commit and the mentioned one.

I guess the log also reveals warning in error path. Please see below.

I didn't check any further. This is simple debian testing system with the following cmdline options:
root at avadebian:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.6-rc7+ ignore_loglevel root=/dev/nvme1n1p2 ro nr_cpus=32

So far simple revert (patch is below) returns things back to normal-ish: there are huge graphics artifacts on Xorg/X11 under 6.1 to upstream kernel. Wayland-based sway works great without issues. Not sure where should I report this.

Please let me know if I can help debugging, testing or provide some other logs regarding 1cfb4d612127? Any cmdline options to collect more info?

Thanks,
Alexey



>From 214372d5cedcf8757dd80d5f4d058377a3d92c52 Mon Sep 17 00:00:00 2001
From: Alexey Klimov <alexey.klimov at linaro.org>
Date: Thu, 26 Oct 2023 17:01:02 +0100
Subject: [PATCH] drm/amdgpu: Revert "drm/amdgpu: put MQDs in VRAM"

This reverts commit 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21.

amdgpu driver fails during initialisation with RX7600/gfx11 on
ADLINK Ampere Altra Developer Platform (AVA developer platform)
with mentioned commit:

[   12.559893] [drm] Display Core v3.2.247 initialized on DCN 3.2.1
[   12.565906] [drm] DP-HDMI FRL PCON supported
[   12.572192] [drm] DMUB hardware initialized: version=0x07000C00
[   12.582541] snd_hda_intel 000d:03:00.1: bound 000d:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[   12.625357] [drm] kiq ring mec 3 pipe 1 q 0
[   12.857087] amdgpu 000d:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.0 test failed (-110)
[   12.867930] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v11_0> failed -110
[   12.877289] amdgpu 000d:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   12.883723] amdgpu 000d:03:00.0: amdgpu: Fatal error during GPU init
[   12.890070] amdgpu 000d:03:00.0: amdgpu: amdgpu: finishing device.
[   12.896586] [drm] DSC precompute is not needed.
[   12.901142] ------------[ cut here ]------------
[   12.905747] WARNING: CPU: 0 PID: 212 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:615 amdgpu_irq_put+0xa8/0xc8 [amdgpu]
[   12.916841] Modules linked in: hid_generic(E) usbhid(E) hid(E) qrtr(E) iptable_nat(E) amdgpu(E+) nf_nat(E) nf_conntrack(E) snd_hda_codec_hdmi(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) iptable_mangle(E) iptable_filter(E) amdxcp(E) drm_exec(E) gpu_sched(E) snd_hda_intel(E) aes_ce_blk(E) snd_intel_dspcfg(E) drm_buddy(E) aes_ce_cipher(E) snd_hda_codec(E) xhci_pci(E) video(E) crct10dif_ce(E) polyval_ce(E) snd_hda_core(E) xhci_hcd(E) drm_suballoc_helper(E) snd_hwdep(E) polyval_generic(E) drm_ttm_helper(E) snd_pcm(E) ghash_ce(E) ast(E) ttm(E) gf128mul(E) snd_timer(E) ipmi_ssif(E) drm_display_helper(E) drm_shmem_helper(E) sha2_ce(E) sha256_arm64(E) ipmi_devintf(E) usbcore(E) snd(E) drm_kms_helper(E) igb(E) sha1_ce(E) sbsa_gwdt(E) ipmi_msghandler(E) arm_spe_pmu(E) soundcore(E) usb_common(E) i2c_algo_bit(E) cppc_cpufreq(E) i2c_designware_platform(E) arm_dsu_pmu(E) arm_cmn(E) xgene_hwmon(E) i2c_designware_core(E) evdev(E) binfmt_misc(E) loop(E) fuse(E) efi_pstore(E) drm(E) dm_mod(E) dax(E) configfs(E) efivarfs(E)
[   12.916916]  ip_tables(E) x_tables(E) autofs4(E)
[   13.011111] CPU: 0 PID: 212 Comm: kworker/0:2 Tainted: G            E      6.6.0-rc7+ #23
[   13.019277] Hardware name: ADLINK Ampere Altra Developer Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.04.100.10 (SYS: 2.06.20220308) 04/18/2
[   13.033084] Workqueue: events work_for_cpu_fn
[   13.037434] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   13.044384] pc : amdgpu_irq_put+0xa8/0xc8 [amdgpu]
[   13.049652] lr : amdgpu_fence_driver_hw_fini+0x118/0x160 [amdgpu]
[   13.056220] sp : ffff80008012bc10
[   13.059522] x29: ffff80008012bc20 x28: 0000000000000000 x27: 0000000000000000
[   13.066647] x26: 0000000000000000 x25: ffff07ff98580010 x24: ffff07ff98580000
[   13.073772] x23: ffff07ff985a78f0 x22: ffff07ff98580010 x21: ffff07ff985904c8
[   13.080896] x20: ffff07ff985900e8 x19: ffff07ff98598580 x18: 0000000000000006
[   13.088020] x17: 0000000000000020 x16: ffffbb510d0d7140 x15: fffffffffffffefb
[   13.095145] x14: 0000000000000000 x13: 2e64656465656e20 x12: ffff07ff8c7fd9e0
[   13.102268] x11: 00000000000003e8 x10: ffff07ff8c7fd9e0 x9 : ffffbb50ac3345e0
[   13.109392] x8 : ffffbb50abf18000 x7 : 0000000000000000 x6 : 000000007a456104
[   13.116516] x5 : 0000000000000000 x4 : ffff07ff98580000 x3 : 0000000000000000
[   13.123641] x2 : 0000000000000000 x1 : ffff07ff985a78f0 x0 : ffff07ffc5fd4000
[   13.130765] Call trace:
[   13.133200]  amdgpu_irq_put+0xa8/0xc8 [amdgpu]
[   13.138121]  amdgpu_device_fini_hw+0xb8/0x380 [amdgpu]
[   13.143732]  amdgpu_driver_unload_kms+0x54/0x80 [amdgpu]
[   13.149517]  amdgpu_driver_load_kms+0x100/0x1c0 [amdgpu]
[   13.155301]  amdgpu_pci_probe+0x134/0x428 [amdgpu]
[   13.160564]  local_pci_probe+0x48/0xb8
[   13.164305]  work_for_cpu_fn+0x24/0x40
[   13.168043]  process_one_work+0x170/0x3d0
[   13.172042]  worker_thread+0x2bc/0x3e0
[   13.175781]  kthread+0x118/0x128
[   13.178999]  ret_from_fork+0x10/0x20
[   13.182564] ---[ end trace 0000000000000000 ]---
...
[   16.984679] amdgpu: probe of 000d:03:00.0 failed with error -110

Cc: Luben Tuikov <luben.tuikov at amd.com>
Cc: Alex Deucher <alexander.deucher at amd.com>
Fixes: 1cfb4d612127 drm/amdgpu: put MQDs in VRAM
Signed-off-by: Alexey Klimov <alexey.klimov at linaro.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 ++-------
 drivers/gpu/drm/amd/amdgpu/mes_v10_1.c  | 1 -
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 1 -
 3 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 2382921710ec..1f2d8be0fc44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -382,11 +382,6 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
 	int r, i, j;
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[xcc_id];
 	struct amdgpu_ring *ring = &kiq->ring;
-	u32 domain = AMDGPU_GEM_DOMAIN_GTT;
-
-	/* Only enable on gfx10 and 11 for now to avoid changing behavior on older chips */
-	if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
-		domain |= AMDGPU_GEM_DOMAIN_VRAM;
 
 	/* create MQD for KIQ */
 	if (!adev->enable_mes_kiq && !ring->mqd_obj) {
@@ -421,7 +416,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
 			ring = &adev->gfx.gfx_ring[i];
 			if (!ring->mqd_obj) {
 				r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
-							    domain, &ring->mqd_obj,
+							    AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
 							    &ring->mqd_gpu_addr, &ring->mqd_ptr);
 				if (r) {
 					dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
@@ -445,7 +440,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
 		ring = &adev->gfx.compute_ring[j];
 		if (!ring->mqd_obj) {
 			r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
-						    domain, &ring->mqd_obj,
+						    AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
 						    &ring->mqd_gpu_addr, &ring->mqd_ptr);
 			if (r) {
 				dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
index eb06d749876f..080e7eb3f98d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
@@ -898,7 +898,6 @@ static int mes_v10_1_mqd_sw_init(struct amdgpu_device *adev,
 		return 0;
 
 	r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
-				    AMDGPU_GEM_DOMAIN_VRAM |
 				    AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
 				    &ring->mqd_gpu_addr, &ring->mqd_ptr);
 	if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 6827d547042e..0608710306b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -1004,7 +1004,6 @@ static int mes_v11_0_mqd_sw_init(struct amdgpu_device *adev,
 		return 0;
 
 	r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
-				    AMDGPU_GEM_DOMAIN_VRAM |
 				    AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
 				    &ring->mqd_gpu_addr, &ring->mqd_ptr);
 	if (r) {
-- 
2.42.0



More information about the amd-gfx mailing list