Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

Tue Dec 21 21:18:33 UTC 2021

On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
<Alexander.Deucher at amd.com> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds <torvalds at linux-foundation.org>; Imre Deak
> > <imre.deak at intel.com>; amd-gfx at lists.freedesktop.org
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>; Kai-Heng Feng
> > <kai.heng.feng at canonical.com>
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Linus Torvalds <torvalds at linux-foundation.org>
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak <imre.deak at intel.com>
> > > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>; Deucher, Alexander
> > > <Alexander.Deucher at amd.com>; Kai-Heng Feng
> > > <kai.heng.feng at canonical.com>
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak at intel.com>
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > >     -int amdgpu_runtime_pm = -1;
> > >     +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
>
> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
>
> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
> https://bugzilla.kernel.org/attachment.cgi?id=300103&action=diff&collapsed=&headers=1&format=raw

I'm not seeing this on my system, and another user tried the patch and
it fixed his system, so it indeed seems to be something along the
lines of what I described above.  Something in userspace seems to be
accessing the GPU causing it to runtime resume.  On resume the driver
then sends a hotplug event to userspace (in case anything display
related changed while the GPU was suspended).  The desktop manager
then sees the hotplug event and reprobes and lights up the displays
again.  @Wentland, Harry, I guess the display code may already handle
this (seeing if anything has changed on resume since suspend), so we
can probably drop the call from the generic amdgpu resume code?  Or if
not, it should be pretty straightforward to fix that in
dm_suspend()/dm_resume().

Alex