Regression in amdgpu driver / kernel v5.8.6
Jean Delvare
jdelvare at suse.de
Tue Sep 29 06:14:04 UTC 2020
Hi Alex,
On Mon, 28 Sep 2020 17:44:28 -0400, Alex Deucher wrote:
> On Mon, Sep 28, 2020 at 9:23 AM Jean Delvare <jdelvare at suse.de> wrote:
> > I have recently experienced a regression in stable kernel series 5.8.
> > The problem is that my display will randomly turn to black after just a
> > few seconds of inactivity. I have to switch to text console then back
> > to X to recover.
> >
> > I bisected it down to:
> >
> > commit b86657e328b601a5b98f8c4c21d108d356dbceee
> > Author: Navid Emamdoost <navid.emamdoost at gmail.com>
> > Date: Sun Jun 14 02:09:44 2020 -0500
> >
> > drm/amdgpu: fix ref count leak in amdgpu_display_crtc_set_config
> >
> > [ Upstream commit e008fa6fb41544b63973a529b704ef342f47cc65 ]
> >
> > Reverting this commit on top of 5.8.10 makes the problem go away. My
> > hardware setup:
> >
> > 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev ff)
> > 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> >
> > This is a Sapphire Pulse Radeon RX550 card, with two Lenovo P27h-10
> > displays connected (one over DP, one over HDMI). I'm using option dc=0
> > otherwise the multi-screen setup doesn't work.
> >
> > Looking at the patch, and at the logic of function
> > amdgpu_display_crtc_set_config() in general, I suspect that the middle
> > chunk of the patch is incorrect. Calling pm_runtime_put_autosuspend()
> > if pm_runtime_get_sync() failed is, albeit surprising, correct due to
> > how that function is implemented. Calling it right after
> > "adev->have_disp_power_ref = true" however looks wrong. The comment
> > right before that line pretty much implies that we *want* to keep the
> > reference.
> >
> > So I think we want to apply a partial revert like the following:
>
> Nice analysis. I've applied the patch.
Unfortunately, further testing shows that this partial revert is
insufficient to fully fix the problem. Only a complete revert works.
This worries me a little though, because my analysis seemed correct. If
the first chunk needs to be reverted too, it means that this code path
is taken on a regular basis in my case. I would expect
pm_runtime_get_sync() failures to be an error path that should normally
not be taken?
At this point I am wondering if maybe we had 2 opposite reference
counting bugs covering for each other, and fixing one revealed the
other. Or maybe the prime issue is that pm_runtime_get_sync() shouldn't
fail in the first place, and that's what we need to fix.
I'm not familiar enough with the amdgpu driver to efficiently continue
debugging this issue on my own, so I'll do what I can, but some
guidance would be welcome.
--
Jean Delvare
SUSE L3 Support
More information about the amd-gfx
mailing list