[PATCH] drm/xe/d3cold: Set power state to D3Cold during s2idle/s3

Tue Apr 1 19:35:49 UTC 2025

On Tue, Apr 1, 2025 at 7:53 PM Raag Jadav <raag.jadav at intel.com> wrote:
>
> On Mon, Mar 31, 2025 at 10:18:16PM +0200, Wysocki, Rafael J wrote:
> > On 3/31/2025 6:15 PM, Rodrigo Vivi wrote:
> > > On Sat, Mar 29, 2025 at 07:20:37AM +0200, Raag Jadav wrote:
> > > > On Fri, Mar 28, 2025 at 12:02:17PM -0400, Rodrigo Vivi wrote:
> > > > > On Thu, Mar 27, 2025 at 01:14:09PM -0400, Rodrigo Vivi wrote:
> > > > > > On Thu, Mar 27, 2025 at 10:02:29PM +0530, Nilawar, Badal wrote:
> > > > > > > On 27-03-2025 21:49, Badal Nilawar wrote:
> > > > > > > Hi Rodrigo,
> > > > > > >
> > > > > > > > According to pci core guidelines, pci_save_config is recommended when the
> > > > > > > > driver explicitly needs to set the pci power state. As of now xe kmd is
> > > > > > > > only doing pci_save_config while entering to s2idle/s3 state, which makes
> > > > > > > > pci core think that device driver has already applied required pci power
> > > > > > > > state. This leads to GPU remain in D0 state. To fix the issue setting
> > > > > > > > the pci power state to D3Cold.
> > > > > > > >
> > > > > > > > Fixes:dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> > > > > > > > Cc: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > > > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > > > Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> > > > > > > > Signed-off-by: Anshuman Gupta <anshuman.gupta at intel.com>
> > > > > > > > ---
> > > > > > > >    drivers/gpu/drm/xe/xe_pci.c | 1 +
> > > > > > > >    1 file changed, 1 insertion(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > > > > > index 7046e7e9a6c7..3317d475be79 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > > > > > @@ -932,6 +932,7 @@ static int xe_pci_suspend(struct device *dev)
> > > > > > > >           pci_save_state(pdev);
> > > > > > > >           pci_disable_device(pdev);
> > > > > > > > + pci_set_power_state(pdev, PCI_D3cold);
> > > > > > > Another approach to avoid calling pci_save_state and pci_set_power_state,
> > > > > > > allowing the PCI core to manage this.
> > > > > > > Currently, the above change aligns with the Xe RPM suspend flow.
> > > > > > Either way is fine it seems. Or we don't save the state and let pci subsystem
> > > > > > handle that for us or we save and set explicitly. So, let's move quickly
> > > > > > with this option here that is already fixing our current issue.
> > > > On our way to become another i915 now, are we?
> > > could you please expand on this?
> > >
> > > A simple git grep on pci_set_power_state and on pci_save_state
> > > return many more entries than i915.
> > >
> > > what am I missing?
>
> It's not about PM at all, just a reference to the infinite
>
> [1] more code -> merge -> dammit -> fix -> [1]
>
> cycle.
>
> > > > While this might be a good enough band-aid, we should probably explore
> > > > how DPM_FLAG_SMART_* work and stop mixing/matching approaches to hide
> > > > issues.
> > > perhaps this indeed...
> > > perhaps the ones doing power state themselves are the ones declaring 'smart'?
> > >
> > > Well, on a dumb script here it looks we have over 110 drivers using pci_save_state
> > > and only 6 of those using DPM_FLAG_SMART_*
> > >
> > > So, I agree that it might be a good idea to explore things here and find the
> > > optimal settings.
> >
> > See https://www.kernel.org/doc/html/latest/driver-api/pm/devices.html#the-dpm-flag-smart-suspend-driver-flag
> >
> > It's fairly accurate.
> >
> > The rule of thumb is to avoid mixing these with calling pci_save_state()
> > directly and setting device state from a driver callback.
>
> That's not what I meant here. There are multiple issues at play.
>
> 1. An AER is reported[*] on root port during system suspend even before we
>    reach any of the driver PM callback. From initial investigation it seems
>    like a case of misbehaviour by some child device, but this is a different
>    issue entirely.
>
> [*] irq/120-aerdrv-145     [002] .....  1264.981023: <stack trace>
> => xe_pm_runtime_resume
> => xe_pci_runtime_resume
> => pci_pm_runtime_resume
> => __rpm_callback
> => rpm_callback
> => rpm_resume
> => __pm_runtime_resume
> => pci_pm_runtime_get_sync
> => __pci_walk_bus
> => pci_walk_bus
> => pcie_do_recovery
> => aer_process_err_devices
> => aer_isr
>
> 2. Setting explicit pci_set_power_state(pdev, PCI_D3cold) from xe_pm_suspend().
>    Although we see many drivers do it for their case, it's quite a questionable
>    choice (atleast IMHO) to hard suspend the device from driver PM callback
>    without any regard to runtime_usage counter. It can hide potential issues
>    like AER during system suspend (regardless of whether or not it is supported
>    by the driver, since it is supposed to keep the device active on such a
>    catastrophic failure anyway), but I'll leave it to the experts to decide.

If the driver does not set DPM_FLAG_SMART_SUSPEND, and xe doesn't set
it AFAICS (at least not in the mainline), pci_pm_suspend() will resume
the device from runtime suspend before invoking its driver's callback.

This guarantees that the device is always RPM_ACTIVE when
xe_pci_suspend() runs and it cannot be runtime-suspended because the
core is holding a runtime PM reference on it (and so the runtime PM
usage counter is always greater than zero when xe_pci_suspend() runs).

This means that neither xe_pci_runtime_suspend() nor
xe_pci_runtime_resume() can run concurrently with respect to it, so
xe_pci_suspend() can change the power state of the device etc. safely.
Of course, it needs to call pci_save_state() before doing any of that,
but it does so.

This is expected to work, but resuming the device from runtime suspend
during a system suspend transition can be avoided (see below).

However, on the resume side, pci_pm_resume_noirq() runs first and
because dev_pm_skip_resume() returns 'false' (since
DPM_FLAG_SMART_SUSPEND is unset) and skip_bus_pm is 'false' (since the
device is not in D0), it will call pci_pm_default_resume_early() which
will (1) put the device into D0 unconditionally and (2) restore its
state.

This means that by the time when xe_pci_resume() runs, the power state
of the device is already D0 and its state has already been restored,
so the initial part of it is arguably redundant.

There also appears to be a problem with calling pci_disable_device()
after pci_save_state() in xe_pci_suspend() because restoring the
device's state in pci_pm_default_resume_early() will inadvertently
cause it to be enabled AFAICS.  I would call pci_save_state() after
pci_disable_device(), so the device would still be disabled after
running pci_pm_default_resume_early() and xe_pci_resume() would enable
it as appropriate.

> 2. Since we're already setting D0 and D3 states in our runtime PM callbacks,
>    utilizing DPM_FLAG_SMART_* flags will allow us to skip unnecessary
>    runtime_resume during system suspend and let PCI PM take care of all
>    the PCI stuff in a smarter way (even skip D3hot/cold if it's already in it)
>    without us needing to duplicate code and do everything manually.

That's true in principle, but care needs to be taken.

First of all, DPM_FLAG_SMART_PREPARE and DPM_FLAG_SMART_SUSPEND are
about different things.

DPM_FLAG_SMART_PREPARE is about the direct-complete optimization that
allows the suspend and resume of the device to be skipped completely
if some additional conditions are met and it allows the driver to
control the behavior related to this optimization (see "Entering
System Suspend" in Documentation/driver-api/pm/devices.rst for
details).  The bottom line here is that because xe doesn't provide a
->prepare() callback in xe_pm_ops, this flag will have no effect.  xe
may want to set DPM_FLAG_NO_DIRECT_COMPLETE to disable the
direct-complete optimization completely, though (i915 does this for
some reason unclear to me).

DPM_FLAG_SMART_SUSPEND is about leaving the device in runtime suspend
across the entire system suspend transition if it is still
runtime-suspended after running its driver's ->suspend() callback.

This means in particular that if DPM_FLAG_SMART_SUSPEND is set, the
->suspend() callback may run in parallel with xe_pci_runtime_resume(),
so it cannot be the current xe_pci_suspend() because it cannot
manipulate the device or call pci_save_state() on it or similar.  The
most straightforward way to overcome this would be to point
->suspend_late() to xe_pci_suspend(), but this would cause the device
to be suspended later unless it was runtime-suspended already.  In
turn, suspending it later may cause the total duration of the system
suspend transition to increase (because the device will now be
suspended in a different phase of the transition).

The suspend time can be distributed somewhat differently if the core
is allowed to take care of putting the device into a low-power state.
Simply removing the pci_save_state() and pci_set_power_state() calls
from xe_pci_suspend() should be sufficient to make this happen and
that would be my first step (after fixing the device disable/enable
issue described above).

Next, you can point ->suspend_late(), instead of ->suspend(), to
xe_pci_suspend() and that is still expected to work, but the total
system suspend timing will change.

If this works, you'll be ready to set DPM_FLAG_SMART_SUSPEND.