drm/amdgpu: AMDGPU unusable since 6.12.1 and it looks like no one cares.
Pavel Nikulin
pavel at noa-labs.com
Sun Jan 19 13:53:22 UTC 2025
On Fri, Jan 17, 2025 at 6:08 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 7:27 AM Pavel Nikulin <pavel at noa-labs.com> wrote:
> >
> > I think it persists as of 6.12.9 and today's firmware version from git.
> >
> > Hardware Asus um560.6
> >
> > It only happens when the AC adaptor is disconnected, and the screen
> > refresh frequency is set to 120hz. It does not happen on any other
> > refresh frequency, or when the charger is connected.
> >
> > It might be happening in Windows, but at much lower rate, like once in
> > a month. The windows version might be applying some mitigations.
> >
> > Trying to catch what may be a prelude to hang never worked. It's just
> > instahang, without panic, or anything. I cannot debug it without
> > JTAGing the CPU, for which I have no equipment, nor am I sure if there
> > are even JTAG headers exposed on the laptop motherboard.
>
> Please file a bug report and attach your dmesg output.
> https://gitlab.freedesktop.org/drm/amd/-/issues
>
> Alex
Unfortunately, what I would have would be the same dmesg as anyone
else, however I have made following observations:
Disabling PSR with debug mask makes it stable.
If I set the refresh frequency to 60Hz, the lpddr memory clocks wiggle
around 600mHz, and keep going back and forth (spread spectrum
working.)
If I switch to any other frequency, they stay stably at 937mhz (spread
spectrum stops working,) and hangs happen.
If I disconnect antennas from the MT7925 WiFi module, the issues are
gone (as well as the wifi connectivity.)
If I RFKILL the mt7925, both wifi, and bluetooth, it may still hang.
If I nevertheless try to connect by putting the open laptop right next
to the access point, the laptop will hang.
But if I only try to do the same with 2.4GHz bluetooth mouse, it will
continue to work. If I connect to 2.4GHz wifi, it will still hang
after a few minutes.
If I use the RTL8156BG based type-C usb dongle, and disconnect the
power. It works stable. If I keep the connection going on type-C
dongle, but switch on wifi, and set it as a default route, everything
works stable, regardless if I connect to 5GHz or 2.4GHz wifi.
If I try to put grounding tape around DP cables, and around the wifi
module, it did not do anything conclusively.
If I try to manually set the GPU performance to high, it marginally
improves the hanging rate.
DP 2.0, and 2.1 works on 600MHz, 1.4 on 300MHz, 1.2 on 150MHz
depending on link speed, which I can't measure
So, here is what think may have happened during the transition from 6.11 to 6.12
- Something PCIE related (ASPM, other PCIE frequency/power settings)
- Something PSR related (PSR raises memory clock rate, disables spread spectrum)
- Something power related (undervoltage happens when type-C port, or
power is not plugged in)
- Something RF related (rendered less likely by it keeping working
with type-C ethernet dongle plugged in, but not active)
My guess it's an interplay in between PCIE, and PSR setting. Less
likely, a hardware problem.
I do remember, someone with a similar bug did dissect the breakage to
a PCIE related commit.
Do you want me to still put all of the above into a bug ticket on gitlab?
More information about the amd-gfx
mailing list