[REGRESSION] RX-580 (gfx803) GPU hangs since ~v6.14.1 – “scheduler comp_1.1.1 is not ready” / ROCm 5.7-6.4+ broken

Wed Jun 25 04:33:13 UTC 2025

Good Afternoon and best wishes!
This is my first attempt at upstreaming an issue after dailying arch for a
full year now :)
Please forgive me, a lot of this is pushing my comfort zone, but preventing
needless e-waste is important to me personally :) with this in mind, I will
save your eyeballs and let you know I did use gpt to help compile the
below, but I have proofread it several times (which means you can't be mad
:p ).

https://github.com/ROCm/ROCm/issues/4965
https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779

Hello Kernel, AMD GPU, & ROCm maintainers,

TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a number
of kernels since v6.14 and newer. This was not previously the case prior to
6.15 for ROCm 6.4.0 on gfx803 cards.

The issue has been successfully mitigated within an older version of ROC
under kernel 6.16rc2 by reverting two specific commits:

   -

   de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
   -

   bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx 9.4+”,
   2025-03-06)

Reverting both commits on top of v6.16-rc3 restores full stability and
allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run.
Instability is usually immediately obvious via eg models failing to
initialise, no errors (other than host dmesg)/segfault reported, which is
the usual failure method under previous kernels.
------------------------------

Problem Description

A number of users report GPU hangs when initialising compute loads,
specifically with ROCm 5.7+ workloads. This issue appears to be a
regression, as it was not present in earlier kernel versions.

System Information:

   -

   OS: Arch Linux
   -

   CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
   -

   GPU: AMD Radeon RX 580 Series (gfx803)
   -

   ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as
per rocminfo
   --support)

------------------------------

Affected Kernels and Regression Details

The problem consistently occurs on v6.14.1-rc1 and newer kernels.

   -

   Last known good: v6.11
   -

   First known bad: v6.12

The regression has been bisected to the following two commits, as reverting
them resolves the issue:

   -

   de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
   -

   bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, 2025-03-06)

Both patches touch amdkfd queue reset paths and are first included in the
exact releases where the regression appears.

Here's a summary of kernel results:

Kernel | Result | Note

------- | -------- | --------

6.13.y (LTS) | OK |

6.14.0 | OK | Baseline - my last working kernel, though I am not exactly
sure which subver

6.14.1-rc1 | BAD | First hang

6.15-rc1 | BAD | Hang

6.15.8 | BAD | Hang

6.16-rc3 | BAD | Hang

6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, ROCm
workloads run for hours.
------------------------------

Reproduction Steps

   1.

   Boot the system with a kernel version exhibiting the issue (e.g.,
   v6.14.1-rc1 or newer without the reverts).
   2.

   Run a ROCm workload that creates several compute queues, for example:
   -

      python stable-diffusion.py
      -

      faster-whisper --model medium ...
      3.

   Upon model initialization, an immediate driver crash occurs. This is
   visible on the host machine via dmesg logs.

Observed Error Messages (dmesg):

[drm] scheduler comp_1.1.1 is not ready, skipping
[drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout
[message continues ad-infinitum while system functions generally]

This is followed by a hard GPU reset (visible in logs, no visual
artifacts), which reliably leads to a full system lockup. Python or Docker
processes become unkillable, requiring a manual reboot. Over time, the
desktop slowly loses interactivity.
------------------------------

Bisect Details

I previously attempted a git bisect (limited to drivers/gpu/drm/amd)
between v6.12 and v6.15-rc1, which identified some further potentially
problematic commits, however due to undersized /boot/ partition was
experiencing some difficulties. In the interim, it seems a user on
<https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779>
the
gfx803 compatibilty repo discovered the below regarding ROC 5.7:

de84484c6f8b07ad0850d6c4  bad
bac38ca057fef2c8c024fe9e  bad

Cherry-picking reverts of both commits on top of v6.16-rc3 restores normal
behavior; leaving either patch in place reproduces the hang.
------------------------------

Relevant Log Excerpts

(Full dmesg logs can be attached separately if needed)

[drm] scheduler comp_1.1.1 is not ready, skipping
[ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout,
signaled seq=123456 emitted seq=123459
[ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset
domain time = 2ms

------------------------------
References:

   -

   It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping
   ... (https://bbs.archlinux.org/viewtopic.php?id=302729)
   -

   Observations about HSA and KFD backends in TinyGrad · GitHub (
   https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48)
   -

   AMD RX580 system freeze on maximum VRAM speed (
   https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639
   )
   -

   LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 (
   https://lkml.org/lkml/2025/4/5/394
   <https://www.google.com/search?q=https://lkml.org/lkml/2025/4/5/394>)
   -

   Commits · torvalds/linux - GitHub (Link for commit de84484) (
   https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335
   <https://www.google.com/search?q=https://github.com/torvalds/linux/commits%3Fbefore%3D805ba04cb7ccfc7d72e834ebd796e043142156ba%2B6335>
   )
   -

   Commits · torvalds/linux - GitHub (Link for commit bac38ca) (
   https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980
   <https://www.google.com/search?q=https://github.com/torvalds/linux/commits%3Fbefore%3D5bc1018675ec28a8a60d83b378d8c3991faa5a27%2B7980>
   )
   -

   ROCm-For-RX580/README.md at main - GitHub (
   https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md)
   -

   ROCm 4.6.0 for gfx803 - GitHub (
   https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
   )
   -

   Compatibility matrices — Use ROCm on Radeon GPUs - AMD (
   https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html
   )

------------------------------

Why this matters

Although gfx803 is End-of-Life (EOL) for official ROCm support, large user
communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it.
Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/)
demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of
relatively recent kernels. This regression significantly impacts the
usability of these cards for compute workloads.
------------------------------

Proposed Next Steps

I suggest the following for further investigation:

   -

   Review the interaction between the new KFD signal-event slow-path and
   legacy GPUs that may lack valid event IDs.
   -

   Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) returns
   stale doorbells on gfx803, potentially causing false positives.
   -

   Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is
   developed.

Please let me know if you require any further diagnostics or testing. I can
easily rebuild kernels and provide annotated traces.

Please find my working document:
https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066

Thanks for your time!

Best regards, big love,

Johl Brown

johlbrown at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250625/ffa0c24f/attachment-0001.htm>