[REGRESSION] rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

Alex Deucher alexdeucher at gmail.com
Thu Oct 26 18:53:40 UTC 2023


On Thu, Oct 26, 2023 at 1:33 PM Alexey Klimov <alexey.klimov at linaro.org> wrote:
>
> #regzbot introduced: 1cfb4d612127
> #regzbot title: rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"
>
> Hi all,
>
> I've been playing with RX7600 and it was observed that amdgpu stopped working between kernel 6.2 and 6.5.
> Then I narrowed it down to 6.4 <-> 6.5-rc1 and finally bisect pointed at 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21
> And I manually checked if it boots/works on the previous commit and the mentioned one.
>
> I guess the log also reveals warning in error path. Please see below.
>
> I didn't check any further. This is simple debian testing system with the following cmdline options:
> root at avadebian:~# cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-6.6-rc7+ ignore_loglevel root=/dev/nvme1n1p2 ro nr_cpus=32
>
> So far simple revert (patch is below) returns things back to normal-ish: there are huge graphics artifacts on Xorg/X11 under 6.1 to upstream kernel. Wayland-based sway works great without issues. Not sure where should I report this.
>
> Please let me know if I can help debugging, testing or provide some other logs regarding 1cfb4d612127? Any cmdline options to collect more info?

Please make sure you have this patch as well:
e602157ec089240861cd641ee2c7c64eeaec09bf ("drm/amdgpu: fix S3 issue if
MQD in VRAM")
Please open a ticket here so we can track this:
https://gitlab.freedesktop.org/drm/amd/-/issues/
I think I see the problem.  Please see if attached patch 1 fixes the
issue.  If this fixes it, that would also explain the issues you are
seeing with Xorg.  It would appear there are limitations around MMIO
access on your platform and unfortunately most graphics APIs require
unaligned access to MMIO space with the CPU.  We can fix the kernel
side pretty easily, but userspace will be a problem.

More below.

>
> Thanks,
> Alexey
>
>
>
> From 214372d5cedcf8757dd80d5f4d058377a3d92c52 Mon Sep 17 00:00:00 2001
> From: Alexey Klimov <alexey.klimov at linaro.org>
> Date: Thu, 26 Oct 2023 17:01:02 +0100
> Subject: [PATCH] drm/amdgpu: Revert "drm/amdgpu: put MQDs in VRAM"
>
> This reverts commit 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21.
>
> amdgpu driver fails during initialisation with RX7600/gfx11 on
> ADLINK Ampere Altra Developer Platform (AVA developer platform)
> with mentioned commit:
>
> [   12.559893] [drm] Display Core v3.2.247 initialized on DCN 3.2.1
> [   12.565906] [drm] DP-HDMI FRL PCON supported
> [   12.572192] [drm] DMUB hardware initialized: version=0x07000C00
> [   12.582541] snd_hda_intel 000d:03:00.1: bound 000d:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> [   12.625357] [drm] kiq ring mec 3 pipe 1 q 0
> [   12.857087] amdgpu 000d:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.0 test failed (-110)
> [   12.867930] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v11_0> failed -110
> [   12.877289] amdgpu 000d:03:00.0: amdgpu: amdgpu_device_ip_init failed
> [   12.883723] amdgpu 000d:03:00.0: amdgpu: Fatal error during GPU init
> [   12.890070] amdgpu 000d:03:00.0: amdgpu: amdgpu: finishing device.
> [   12.896586] [drm] DSC precompute is not needed.
> [   12.901142] ------------[ cut here ]------------
> [   12.905747] WARNING: CPU: 0 PID: 212 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:615 amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   12.916841] Modules linked in: hid_generic(E) usbhid(E) hid(E) qrtr(E) iptable_nat(E) amdgpu(E+) nf_nat(E) nf_conntrack(E) snd_hda_codec_hdmi(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) iptable_mangle(E) iptable_filter(E) amdxcp(E) drm_exec(E) gpu_sched(E) snd_hda_intel(E) aes_ce_blk(E) snd_intel_dspcfg(E) drm_buddy(E) aes_ce_cipher(E) snd_hda_codec(E) xhci_pci(E) video(E) crct10dif_ce(E) polyval_ce(E) snd_hda_core(E) xhci_hcd(E) drm_suballoc_helper(E) snd_hwdep(E) polyval_generic(E) drm_ttm_helper(E) snd_pcm(E) ghash_ce(E) ast(E) ttm(E) gf128mul(E) snd_timer(E) ipmi_ssif(E) drm_display_helper(E) drm_shmem_helper(E) sha2_ce(E) sha256_arm64(E) ipmi_devintf(E) usbcore(E) snd(E) drm_kms_helper(E) igb(E) sha1_ce(E) sbsa_gwdt(E) ipmi_msghandler(E) arm_spe_pmu(E) soundcore(E) usb_common(E) i2c_algo_bit(E) cppc_cpufreq(E) i2c_designware_platform(E) arm_dsu_pmu(E) arm_cmn(E) xgene_hwmon(E) i2c_designware_core(E) evdev(E) binfmt_misc(E) loop(E) fuse(E) efi_pstore(E) drm(E) dm_mod(E) dax(E) configfs(E) efivarfs(E)
> [   12.916916]  ip_tables(E) x_tables(E) autofs4(E)
> [   13.011111] CPU: 0 PID: 212 Comm: kworker/0:2 Tainted: G            E      6.6.0-rc7+ #23
> [   13.019277] Hardware name: ADLINK Ampere Altra Developer Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.04.100.10 (SYS: 2.06.20220308) 04/18/2
> [   13.033084] Workqueue: events work_for_cpu_fn
> [   13.037434] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   13.044384] pc : amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   13.049652] lr : amdgpu_fence_driver_hw_fini+0x118/0x160 [amdgpu]
> [   13.056220] sp : ffff80008012bc10
> [   13.059522] x29: ffff80008012bc20 x28: 0000000000000000 x27: 0000000000000000
> [   13.066647] x26: 0000000000000000 x25: ffff07ff98580010 x24: ffff07ff98580000
> [   13.073772] x23: ffff07ff985a78f0 x22: ffff07ff98580010 x21: ffff07ff985904c8
> [   13.080896] x20: ffff07ff985900e8 x19: ffff07ff98598580 x18: 0000000000000006
> [   13.088020] x17: 0000000000000020 x16: ffffbb510d0d7140 x15: fffffffffffffefb
> [   13.095145] x14: 0000000000000000 x13: 2e64656465656e20 x12: ffff07ff8c7fd9e0
> [   13.102268] x11: 00000000000003e8 x10: ffff07ff8c7fd9e0 x9 : ffffbb50ac3345e0
> [   13.109392] x8 : ffffbb50abf18000 x7 : 0000000000000000 x6 : 000000007a456104
> [   13.116516] x5 : 0000000000000000 x4 : ffff07ff98580000 x3 : 0000000000000000
> [   13.123641] x2 : 0000000000000000 x1 : ffff07ff985a78f0 x0 : ffff07ffc5fd4000
> [   13.130765] Call trace:
> [   13.133200]  amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   13.138121]  amdgpu_device_fini_hw+0xb8/0x380 [amdgpu]
> [   13.143732]  amdgpu_driver_unload_kms+0x54/0x80 [amdgpu]
> [   13.149517]  amdgpu_driver_load_kms+0x100/0x1c0 [amdgpu]
> [   13.155301]  amdgpu_pci_probe+0x134/0x428 [amdgpu]
> [   13.160564]  local_pci_probe+0x48/0xb8
> [   13.164305]  work_for_cpu_fn+0x24/0x40
> [   13.168043]  process_one_work+0x170/0x3d0
> [   13.172042]  worker_thread+0x2bc/0x3e0
> [   13.175781]  kthread+0x118/0x128
> [   13.178999]  ret_from_fork+0x10/0x20
> [   13.182564] ---[ end trace 0000000000000000 ]---
> ...
> [   16.984679] amdgpu: probe of 000d:03:00.0 failed with error -110
>
> Cc: Luben Tuikov <luben.tuikov at amd.com>
> Cc: Alex Deucher <alexander.deucher at amd.com>
> Fixes: 1cfb4d612127 drm/amdgpu: put MQDs in VRAM
> Signed-off-by: Alexey Klimov <alexey.klimov at linaro.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 ++-------
>  drivers/gpu/drm/amd/amdgpu/mes_v10_1.c  | 1 -
>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 1 -
>  3 files changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 2382921710ec..1f2d8be0fc44 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -382,11 +382,6 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>         int r, i, j;
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq[xcc_id];
>         struct amdgpu_ring *ring = &kiq->ring;
> -       u32 domain = AMDGPU_GEM_DOMAIN_GTT;
> -
> -       /* Only enable on gfx10 and 11 for now to avoid changing behavior on older chips */
> -       if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
> -               domain |= AMDGPU_GEM_DOMAIN_VRAM;

Just removing the addition of the AMDGPU_GEM_DOMAIN_VRAM domain here
will revert the behavior.  Since this is an important optimization and
we aren't seeing any issues on x86, I'd prefer to just limit your arch
to GTT if we can't resolve it some other way.

Try patch 1 and if that doesn't work we can fall back to some variant
of patch 2.

Alex

>
>         /* create MQD for KIQ */
>         if (!adev->enable_mes_kiq && !ring->mqd_obj) {
> @@ -421,7 +416,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>                         ring = &adev->gfx.gfx_ring[i];
>                         if (!ring->mqd_obj) {
>                                 r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                                           domain, &ring->mqd_obj,
> +                                                           AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                                             &ring->mqd_gpu_addr, &ring->mqd_ptr);
>                                 if (r) {
>                                         dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
> @@ -445,7 +440,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>                 ring = &adev->gfx.compute_ring[j];
>                 if (!ring->mqd_obj) {
>                         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                                   domain, &ring->mqd_obj,
> +                                                   AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>                         if (r) {
>                                 dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> index eb06d749876f..080e7eb3f98d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> @@ -898,7 +898,6 @@ static int mes_v10_1_mqd_sw_init(struct amdgpu_device *adev,
>                 return 0;
>
>         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                   AMDGPU_GEM_DOMAIN_VRAM |
>                                     AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>         if (r) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> index 6827d547042e..0608710306b8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> @@ -1004,7 +1004,6 @@ static int mes_v11_0_mqd_sw_init(struct amdgpu_device *adev,
>                 return 0;
>
>         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                   AMDGPU_GEM_DOMAIN_VRAM |
>                                     AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>         if (r) {
> --
> 2.42.0
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amdgpu-gfx10-11-use-memcpy_to-fromio-for-MQDs.patch
Type: text/x-patch
Size: 4970 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231026/630fc17d/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-drm-amdgpu-don-t-put-MQDs-in-VRAM-on-ARM-ARM64.patch
Type: text/x-patch
Size: 1176 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231026/630fc17d/attachment-0003.bin>


More information about the amd-gfx mailing list