Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

Tue Dec 21 22:18:56 UTC 2021

On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland <harry.wentland at amd.com> wrote:
>
>
>
> On 2021-12-21 16:18, Alex Deucher wrote:
> > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> > <Alexander.Deucher at amd.com> wrote:
> >>
> >> [Public]
> >>
> >>> -----Original Message-----
> >>> From: Deucher, Alexander
> >>> Sent: Tuesday, December 21, 2021 12:01 PM
> >>> To: Linus Torvalds <torvalds at linux-foundation.org>; Imre Deak
> >>> <imre.deak at intel.com>; amd-gfx at lists.freedesktop.org
> >>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>; Kai-Heng Feng
> >>> <kai.heng.feng at canonical.com>
> >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> >>> PCI device ..."
> >>>
> >>> [Public]
> >>>
> >>>> -----Original Message-----
> >>>> From: Linus Torvalds <torvalds at linux-foundation.org>
> >>>> Sent: Monday, December 20, 2021 5:05 PM
> >>>> To: Imre Deak <imre.deak at intel.com>
> >>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>; Deucher, Alexander
> >>>> <Alexander.Deucher at amd.com>; Kai-Heng Feng
> >>>> <kai.heng.feng at canonical.com>
> >>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> >>>> Release PCI device ..."
> >>>>
> >>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak at intel.com>
> >>> wrote:
> >>>>>
> >>>>> amdgpu.runpm=0
> >>>>
> >>>> Hmmm.
> >>>>
> >>>> This does seem to "work", but not very well.
> >>>>
> >>>> With this, what seems to happen is odd: I lock the screen, wait, it
> >>>> goes "No signal, shutting down", but then doesn't actually shut down
> >>>> but stays black (with the backlight on). After _another_ five seconds
> >>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
> >>> point it actually does it.
> >>>>
> >>>> So it solves my immediate problem - in that yes, the backlight finally
> >>>> does turn off in the end - but it does seem to be still broken.
> >>>>
> >>>> I'm very surprised if no AMD drm developers can see this exact same thing.
> >>>> This is a very simple setup. The only possibly slightly less common
> >>>> thing is that I have two monitors, but while that is not necessarily
> >>>> the _most_ common setup in an absolute sense, I'd expect it to be very
> >>>> common among DRM developers..
> >>>>
> >>>> I guess I can just change the revert to just a
> >>>>
> >>>>     -int amdgpu_runtime_pm = -1;
> >>>>     +int amdgpu_runtime_pm = 0;
> >>>>
> >>>> instead. The auto-detect is apparently broken. Maybe it should only
> >>>> kick in for LVDS screens on actual laptops?
> >>>>
> >>>> Note: on my machine, I get that
> >>>>
> >>>>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> >>>>
> >>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> >>>> and it's only that BACO case that is broken.
> >>>>
> >>>> I have no idea what any of those three things are - I'm just looking
> >>>> at the uses of that amdgpu_runtime_pm variable.
> >>>>
> >>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> >>>> default, tell me something else to try.
> >>>
> >>> For a little background, runtime PM support was added about 10 year ago
> >>> originally to support laptops with multiple GPUs (integrated and discrete).
> >>> It's not specific to the display hardware.  When the GPU is idle, it can be
> >>> powered down completely.  In the case of these laptops, it's D3 cold
> >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> >>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
> >>> few years ago we extended this to support desktop dGPUs as well which
> >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> >>> Active, Chip Off).  The driver can put the chip into a low power state where
> >>> everything except the bus interface is powered down (to avoid the device
> >>> disappearing from the bus).  So this has worked for almost 2 years now on
> >>> BACO capable parts and for a decade or more on BOCO systems.
> >>> Unfortunately, changing the default runpm parameter setting would cause a
> >>> flood of bug reports about runtime power management breaking and
> >>> suddenly systems are using more power.
> >>>
> >>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> >>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> >>> there is still some race between when amdgpu takes over from efifb?  Does
> >>> it work properly when all pm_runtime calls in efifb are removed or if efifb is
> >>> not enabled?  Runtime pm for Polaris boards has been enabled by default
> >>> since 4fdda2e66de0b which predates both of those patches.
> >>
> >> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
> >>
> >> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
> >>
> >> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&reserved=0
> >
> > I'm not seeing this on my system, and another user tried the patch and
> > it fixed his system, so it indeed seems to be something along the
> > lines of what I described above.  Something in userspace seems to be
> > accessing the GPU causing it to runtime resume.  On resume the driver
> > then sends a hotplug event to userspace (in case anything display
> > related changed while the GPU was suspended).  The desktop manager
> > then sees the hotplug event and reprobes and lights up the displays
> > again.  @Wentland, Harry, I guess the display code may already handle
> > this (seeing if anything has changed on resume since suspend), so we
> > can probably drop the call from the generic amdgpu resume code?  Or if
> > not, it should be pretty straightforward to fix that in
> > dm_suspend()/dm_resume().
> >
>
> We handle re-detection in dm_resume but only seem to call the
> drm_kms_helper_hotplug_event for MST. We might want to call it
> in dm_resume but that would just move it from amdgpu_device_resume
> and will probably cause the same issue.
>
> What I think we'd really want here is to make sure
> drm_kms_helper_hotplug_event is only called when any display
> config actually changes. Unfortunately, we're not being very
> smart in our detection and just recreate everything (even
> though dc_link_detect_helper seems to have code that wants to
> be smart enough to detect if the sink is changed or not).
> This means this change is non-trivial. At least I can't see
> an easy fix that doesn't run the risk of breaking a bunch of
> stuff.

I think something like this patch may do the trick:
https://bugzilla.kernel.org/attachment.cgi?id=300109&action=diff&collapsed=&headers=1&format=raw
We can use the helper to check if anything actually changed.

Alex

>
> Or maybe someone can try to see if (non-MST) detection during
> RPM still works with your patch. If it does (for some reason)
> then we should be good.
>
> I can't seem to repro the problem on my Navi 1x card with KDE Plasma.
> I wonder if it's a Polaris issue.
>
> I also don't see my Navi 1x card going into RPM after
> 'xset dpms force off' with the Manjaro 5.15 kernel. Might need
> to try the latest mainline or amd-stg.
>
> Harry
>
>
> > Alex