Re: [REGRESSION] RX-580 (gfx803) GPU hangs since ~v6.14.1 – “scheduler comp_1.1.1 is not ready” / ROCm 5.7-6.4+ broken

Tue Jul 1 09:39:12 UTC 2025

Hi all, hoping I'm still on-side... Thank you for your consideration.
Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT Wed, 21 May 2025
13:21:26 +0000 x86_64 GNU/Linux

AMDGPU sequence
Time Message
19:29:29 *GPU fault detected* (0x00020802) for process *kdeconnect-app (pid
2285)*; VM fault at page 2048, write from *TC0*.
19:29:29 Second fault (0x0000880c) for same process; VM fault at page 0,
read from *TC6*.
19:29:39 *ring gfx timeout* (signaled seq 699, emitted seq 701) → “Starting
gfx ring reset” → *Ring gfx reset failure*.
19:29:40 Self-tests: ring comp_1.0.1 test failed (-110) and ring comp_1.2.1
test failed (-110).

On Thu, 26 Jun 2025 at 10:38, Johl Brown <johlbrown at gmail.com> wrote:

> Apologies, I believe it was attached to one of the above posts. Please
> find complete dmesg attached.
>
> I had previously attempted to GDB/Ghidra at (
> https://github.com/lamikr/rocm_sdk_builder/issues/173 ) while
> experiencing segfaults on previous kernels/roc.
> Around Nov 3, 2024 (I can't see any comment I made there about kernel
> version but currently Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT
> Wed, 21 May 2025 13:21:26 +0000 x86_64 GNU/Linux. I'm just testing rt due
> to easyeffects glitches but generally I run mainline kernel and update
> roughly weekly so the kernel should be current for that time period)
> eg:
>
> /opt/rocm_sdk_612/bin/hipcc hello_world.o -fPIE -o hello_world
> ./hello_world
>  System minor: 0
>  System major: 8
>  Agent name: AMD Radeon RX 580 Series
> Kernel input: GdkknVnqkc
> Expecting that kernel increases each character from input string by one
> make: *** [Makefile:18: test] Segmentation fault (core dumped)
>  System minor: 0
>  System major: 8
>  Agent name: AMD Radeon RX 580 Series
> Kernel input: GdkknVnqkc
> Expecting that kernel increases each character from input string by one
> Segmentation fault (core dumped)
>
>
> [New Thread 0x7fffecaea6c0 (LWP 2980691)]
>
> [New Thread 0x7fffe7fff6c0 (LWP 2980692)]
>
> [Thread 0x7fffe7fff6c0 (LWP 2980692) exited]
>
>  System minor: 0
>
>  System major: 8
>
>  Agent name: AMD Radeon RX 580 Series
>
> Kernel input: GdkknVnqkc
>
> Expecting that kernel increases each character from input string by one
>
>
> Thread 1 "hello_world" received signal SIGSEGV, Segmentation fault.
>
> 0x00007ffff7db0fbd in ?? ()
>
>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>
> (gdb) bt
>
> #0  0x00007ffff7db0fbd in ?? ()
>
>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>
> #1  0x00007ffff7c1497f in ?? ()
>
>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>
> #2  0x00007ffff7c14c74 in ?? ()
>
>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>
> #3  0x00007ffff7c14e3e in ?? ()
>
>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>
> #4  0x00005555555555bf in main (argc=<optimized out>,
>
>     argv=<optimized out>) at hello_world.cpp:69
>
> (gdb)
>
> Line 69 (nice) is res = hipMemcpy(inputBuffer, input, (strlength + 1) *
> sizeof(char), hipMemcpyHostToDevice); (see attached file jb_gdb_tester)
>
>
>
> https://github.com/robertrosenbusch/gfx803_rocm/issues/35
>
>
> One love!!
>
> On Thu, 26 Jun 2025 at 10:10, Felix Kuehling <felix.kuehling at amd.com>
> wrote:
>
>> I couldn't find a dmesg attched to the linked bug reports. I was going to
>> look for a kernel oops from calling an uninitialized function pointer. Your
>> patch addresses just that.
>>
>> I'm not sure how “drm/amdkfd: Improve signal event slow path” is
>> implicated. I don't see anything in that patch that would break
>> specifically on gfx v803.
>>
>> Regards,
>>   Felix
>>
>> On 2025-06-25 18:21, Alex Deucher wrote:
>> > Adding folks from the KFD team to take a look.  Thank you for
>> > bisecting.  Does the attached patch fix it?
>> >
>> > Thanks,
>> >
>> > Alex
>> >
>> > On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbrown at gmail.com>
>> wrote:
>> >> Good Afternoon and best wishes!
>> >> This is my first attempt at upstreaming an issue after dailying arch
>> for a full year now :)
>> >> Please forgive me, a lot of this is pushing my comfort zone, but
>> preventing needless e-waste is important to me personally :) with this in
>> mind, I will save your eyeballs and let you know I did use gpt to help
>> compile the below, but I have proofread it several times (which means you
>> can't be mad :p ).
>> >>
>> >>
>> >> https://github.com/ROCm/ROCm/issues/4965
>> >>
>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
>> >>
>> >>
>> >> Hello Kernel, AMD GPU, & ROCm maintainers,
>> >>
>> >> TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a
>> number of kernels since v6.14 and newer. This was not previously the case
>> prior to 6.15 for ROCm 6.4.0 on gfx803 cards.
>> >>
>> >> The issue has been successfully mitigated within an older version of
>> ROC under kernel 6.16rc2 by reverting two specific commits:
>> >>
>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
>> >>
>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx
>> 9.4+”, 2025-03-06)
>> >>
>> >> Reverting both commits on top of v6.16-rc3 restores full stability and
>> allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run.
>> Instability is usually immediately obvious via eg models failing to
>> initialise, no errors (other than host dmesg)/segfault reported, which is
>> the usual failure method under previous kernels.
>> >>
>> >> ________________________________
>> >>
>> >> Problem Description
>> >>
>> >> A number of users report GPU hangs when initialising compute loads,
>> specifically with ROCm 5.7+ workloads. This issue appears to be a
>> regression, as it was not present in earlier kernel versions.
>> >>
>> >> System Information:
>> >>
>> >> OS: Arch Linux
>> >>
>> >> CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
>> >>
>> >> GPU: AMD Radeon RX 580 Series (gfx803)
>> >>
>> >> ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per
>> rocminfo --support)
>> >>
>> >> ________________________________
>> >>
>> >> Affected Kernels and Regression Details
>> >>
>> >> The problem consistently occurs on v6.14.1-rc1 and newer kernels.
>> >>
>> >> Last known good: v6.11
>> >>
>> >> First known bad: v6.12
>> >>
>> >> The regression has been bisected to the following two commits, as
>> reverting them resolves the issue:
>> >>
>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
>> >>
>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”,
>> 2025-03-06)
>> >>
>> >> Both patches touch amdkfd queue reset paths and are first included in
>> the exact releases where the regression appears.
>> >>
>> >> Here's a summary of kernel results:
>> >>
>> >> Kernel | Result | Note
>> >>
>> >> ------- | -------- | --------
>> >>
>> >> 6.13.y (LTS) | OK |
>> >>
>> >> 6.14.0 | OK | Baseline - my last working kernel, though I am not
>> exactly sure which subver
>> >>
>> >> 6.14.1-rc1 | BAD | First hang
>> >>
>> >> 6.15-rc1 | BAD | Hang
>> >>
>> >> 6.15.8 | BAD | Hang
>> >>
>> >> 6.16-rc3 | BAD | Hang
>> >>
>> >> 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored,
>> ROCm workloads run for hours.
>> >>
>> >> ________________________________
>> >>
>> >> Reproduction Steps
>> >>
>> >> Boot the system with a kernel version exhibiting the issue (e.g.,
>> v6.14.1-rc1 or newer without the reverts).
>> >>
>> >> Run a ROCm workload that creates several compute queues, for example:
>> >>
>> >> python stable-diffusion.py
>> >>
>> >> faster-whisper --model medium ...
>> >>
>> >> Upon model initialization, an immediate driver crash occurs. This is
>> visible on the host machine via dmesg logs.
>> >>
>> >> Observed Error Messages (dmesg):
>> >>
>> >> [drm] scheduler comp_1.1.1 is not ready, skipping
>> >> [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout
>> >> [message continues ad-infinitum while system functions generally]
>> >>
>> >> This is followed by a hard GPU reset (visible in logs, no visual
>> artifacts), which reliably leads to a full system lockup. Python or Docker
>> processes become unkillable, requiring a manual reboot. Over time, the
>> desktop slowly loses interactivity.
>> >>
>> >> ________________________________
>> >>
>> >> Bisect Details
>> >>
>> >> I previously attempted a git bisect (limited to drivers/gpu/drm/amd)
>> between v6.12 and v6.15-rc1, which identified some further potentially
>> problematic commits, however due to undersized /boot/ partition was
>> experiencing some difficulties. In the interim, it seems a user on  the
>> gfx803 compatibilty repo discovered the below regarding ROC 5.7:
>> >>
>> >> de84484c6f8b07ad0850d6c4  bad
>> >> bac38ca057fef2c8c024fe9e  bad
>> >>
>> >> Cherry-picking reverts of both commits on top of v6.16-rc3 restores
>> normal behavior; leaving either patch in place reproduces the hang.
>> >>
>> >> ________________________________
>> >>
>> >> Relevant Log Excerpts
>> >>
>> >> (Full dmesg logs can be attached separately if needed)
>> >>
>> >> [drm] scheduler comp_1.1.1 is not ready, skipping
>> >> [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout,
>> signaled seq=123456 emitted seq=123459
>> >> [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset
>> domain time = 2ms
>> >>
>> >> ________________________________
>> >> References:
>> >>
>> >> It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping
>> ... (https://bbs.archlinux.org/viewtopic.php?id=302729)
>> >>
>> >> Observations about HSA and KFD backends in TinyGrad · GitHub (
>> https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48)
>> >>
>> >> AMD RX580 system freeze on maximum VRAM speed (
>> https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639
>> )
>> >>
>> >> LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 (
>> https://lkml.org/lkml/2025/4/5/394)
>> >>
>> >> Commits · torvalds/linux - GitHub (Link for commit de84484) (
>> https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335
>> )
>> >>
>> >> Commits · torvalds/linux - GitHub (Link for commit bac38ca) (
>> https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980
>> )
>> >>
>> >> ROCm-For-RX580/README.md at main - GitHub (
>> https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md)
>> >>
>> >> ROCm 4.6.0 for gfx803 - GitHub (
>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
>> )
>> >>
>> >> Compatibility matrices — Use ROCm on Radeon GPUs - AMD (
>> https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html
>> )
>> >>
>> >>
>> >> ________________________________
>> >>
>> >> Why this matters
>> >>
>> >> Although gfx803 is End-of-Life (EOL) for official ROCm support, large
>> user communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it.
>> Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/)
>> demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of
>> relatively recent kernels. This regression significantly impacts the
>> usability of these cards for compute workloads.
>> >>
>> >> ________________________________
>> >>
>> >> Proposed Next Steps
>> >>
>> >> I suggest the following for further investigation:
>> >>
>> >> Review the interaction between the new KFD signal-event slow-path and
>> legacy GPUs that may lack valid event IDs.
>> >>
>> >> Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca)
>> returns stale doorbells on gfx803, potentially causing false positives.
>> >>
>> >> Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is
>> developed.
>> >>
>> >> Please let me know if you require any further diagnostics or testing.
>> I can easily rebuild kernels and provide annotated traces.
>> >>
>> >> Please find my working document:
>> https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066
>> >>
>> >> Thanks for your time!
>> >>
>> >> Best regards, big love,
>> >>
>> >> Johl Brown
>> >>
>> >> johlbrown at gmail.com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250701/88583a58/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: last_boot_errors.log
Type: text/x-log
Size: 67858 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250701/88583a58/attachment-0001.bin>