[PATCH] drm/xe/d3cold: Set power state to D3Cold during s2idle/s3

Thu Apr 3 11:16:22 UTC 2025

On Thu, Apr 3, 2025 at 9:37 AM Nilawar, Badal <badal.nilawar at intel.com> wrote:
>
>
> On 02-04-2025 01:05, Rafael J. Wysocki wrote:
> > On Tue, Apr 1, 2025 at 7:53 PM Raag Jadav <raag.jadav at intel.com> wrote:
> >> On Mon, Mar 31, 2025 at 10:18:16PM +0200, Wysocki, Rafael J wrote:
> >>> On 3/31/2025 6:15 PM, Rodrigo Vivi wrote:
> >>>> On Sat, Mar 29, 2025 at 07:20:37AM +0200, Raag Jadav wrote:
> >>>>> On Fri, Mar 28, 2025 at 12:02:17PM -0400, Rodrigo Vivi wrote:
> >>>>>> On Thu, Mar 27, 2025 at 01:14:09PM -0400, Rodrigo Vivi wrote:
> >>>>>>> On Thu, Mar 27, 2025 at 10:02:29PM +0530, Nilawar, Badal wrote:
> >>>>>>>> On 27-03-2025 21:49, Badal Nilawar wrote:
> >>>>>>>> Hi Rodrigo,
> >>>>>>>>
> >>>>>>>>> According to pci core guidelines, pci_save_config is recommended when the
> >>>>>>>>> driver explicitly needs to set the pci power state. As of now xe kmd is
> >>>>>>>>> only doing pci_save_config while entering to s2idle/s3 state, which makes
> >>>>>>>>> pci core think that device driver has already applied required pci power
> >>>>>>>>> state. This leads to GPU remain in D0 state. To fix the issue setting
> >>>>>>>>> the pci power state to D3Cold.
> >>>>>>>>>
> >>>>>>>>> Fixes:dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> >>>>>>>>> Cc: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> >>>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >>>>>>>>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> >>>>>>>>> Signed-off-by: Anshuman Gupta <anshuman.gupta at intel.com>
> >>>>>>>>> ---
> >>>>>>>>>     drivers/gpu/drm/xe/xe_pci.c | 1 +
> >>>>>>>>>     1 file changed, 1 insertion(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> >>>>>>>>> index 7046e7e9a6c7..3317d475be79 100644
> >>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pci.c
> >>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
> >>>>>>>>> @@ -932,6 +932,7 @@ static int xe_pci_suspend(struct device *dev)
> >>>>>>>>>            pci_save_state(pdev);
> >>>>>>>>>            pci_disable_device(pdev);
> >>>>>>>>> + pci_set_power_state(pdev, PCI_D3cold);
> >>>>>>>> Another approach to avoid calling pci_save_state and pci_set_power_state,
> >>>>>>>> allowing the PCI core to manage this.
> >>>>>>>> Currently, the above change aligns with the Xe RPM suspend flow.
> >>>>>>> Either way is fine it seems. Or we don't save the state and let pci subsystem
> >>>>>>> handle that for us or we save and set explicitly. So, let's move quickly
> >>>>>>> with this option here that is already fixing our current issue.
> >>>>> On our way to become another i915 now, are we?
> >>>> could you please expand on this?
> >>>>
> >>>> A simple git grep on pci_set_power_state and on pci_save_state
> >>>> return many more entries than i915.
> >>>>
> >>>> what am I missing?
> >> It's not about PM at all, just a reference to the infinite
> >>
> >> [1] more code -> merge -> dammit -> fix -> [1]
> >>
> >> cycle.
> >>
> >>>>> While this might be a good enough band-aid, we should probably explore
> >>>>> how DPM_FLAG_SMART_* work and stop mixing/matching approaches to hide
> >>>>> issues.
> >>>> perhaps this indeed...
> >>>> perhaps the ones doing power state themselves are the ones declaring 'smart'?
> >>>>
> >>>> Well, on a dumb script here it looks we have over 110 drivers using pci_save_state
> >>>> and only 6 of those using DPM_FLAG_SMART_*
> >>>>
> >>>> So, I agree that it might be a good idea to explore things here and find the
> >>>> optimal settings.
> >>> See https://www.kernel.org/doc/html/latest/driver-api/pm/devices.html#the-dpm-flag-smart-suspend-driver-flag
> >>>
> >>> It's fairly accurate.
> >>>
> >>> The rule of thumb is to avoid mixing these with calling pci_save_state()
> >>> directly and setting device state from a driver callback.
> >> That's not what I meant here. There are multiple issues at play.
> >>
> >> 1. An AER is reported[*] on root port during system suspend even before we
> >>     reach any of the driver PM callback. From initial investigation it seems
> >>     like a case of misbehaviour by some child device, but this is a different
> >>     issue entirely.
> >>
> >> [*] irq/120-aerdrv-145     [002] .....  1264.981023: <stack trace>
> >> => xe_pm_runtime_resume
> >> => xe_pci_runtime_resume
> >> => pci_pm_runtime_resume
> >> => __rpm_callback
> >> => rpm_callback
> >> => rpm_resume
> >> => __pm_runtime_resume
> >> => pci_pm_runtime_get_sync
> >> => __pci_walk_bus
> >> => pci_walk_bus
> >> => pcie_do_recovery
> >> => aer_process_err_devices
> >> => aer_isr
> >>
> >> 2. Setting explicit pci_set_power_state(pdev, PCI_D3cold) from xe_pm_suspend().
> >>     Although we see many drivers do it for their case, it's quite a questionable
> >>     choice (atleast IMHO) to hard suspend the device from driver PM callback
> >>     without any regard to runtime_usage counter. It can hide potential issues
> >>     like AER during system suspend (regardless of whether or not it is supported
> >>     by the driver, since it is supposed to keep the device active on such a
> >>     catastrophic failure anyway), but I'll leave it to the experts to decide.
> > If the driver does not set DPM_FLAG_SMART_SUSPEND, and xe doesn't set
> > it AFAICS (at least not in the mainline), pci_pm_suspend() will resume
> > the device from runtime suspend before invoking its driver's callback.
> >
> > This guarantees that the device is always RPM_ACTIVE when
> > xe_pci_suspend() runs and it cannot be runtime-suspended because the
> > core is holding a runtime PM reference on it (and so the runtime PM
> > usage counter is always greater than zero when xe_pci_suspend() runs).
> >
> > This means that neither xe_pci_runtime_suspend() nor
> > xe_pci_runtime_resume() can run concurrently with respect to it, so
> > xe_pci_suspend() can change the power state of the device etc. safely.
> > Of course, it needs to call pci_save_state() before doing any of that,
> > but it does so.
> >
> > This is expected to work, but resuming the device from runtime suspend
> > during a system suspend transition can be avoided (see below).
> >
> > However, on the resume side, pci_pm_resume_noirq() runs first and
> > because dev_pm_skip_resume() returns 'false' (since
> > DPM_FLAG_SMART_SUSPEND is unset) and skip_bus_pm is 'false' (since the
> > device is not in D0), it will call pci_pm_default_resume_early() which
> > will (1) put the device into D0 unconditionally and (2) restore its
> > state.
> >
> > This means that by the time when xe_pci_resume() runs, the power state
> > of the device is already D0 and its state has already been restored,
> > so the initial part of it is arguably redundant.
> >
> > There also appears to be a problem with calling pci_disable_device()
> > after pci_save_state() in xe_pci_suspend() because restoring the
> > device's state in pci_pm_default_resume_early() will inadvertently
> > cause it to be enabled AFAICS.  I would call pci_save_state() after
> > pci_disable_device(), so the device would still be disabled after
> > running pci_pm_default_resume_early() and xe_pci_resume() would enable
> > it as appropriate.
> >
> >> 2. Since we're already setting D0 and D3 states in our runtime PM callbacks,
> >>     utilizing DPM_FLAG_SMART_* flags will allow us to skip unnecessary
> >>     runtime_resume during system suspend and let PCI PM take care of all
> >>     the PCI stuff in a smarter way (even skip D3hot/cold if it's already in it)
> >>     without us needing to duplicate code and do everything manually.
> > That's true in principle, but care needs to be taken.
> >
> > First of all, DPM_FLAG_SMART_PREPARE and DPM_FLAG_SMART_SUSPEND are
> > about different things.
> >
> > DPM_FLAG_SMART_PREPARE is about the direct-complete optimization that
> > allows the suspend and resume of the device to be skipped completely
> > if some additional conditions are met and it allows the driver to
> > control the behavior related to this optimization (see "Entering
> > System Suspend" in Documentation/driver-api/pm/devices.rst for
> > details).  The bottom line here is that because xe doesn't provide a
> > ->prepare() callback in xe_pm_ops, this flag will have no effect.  xe
> > may want to set DPM_FLAG_NO_DIRECT_COMPLETE to disable the
> > direct-complete optimization completely, though (i915 does this for
> > some reason unclear to me).
> >
> > DPM_FLAG_SMART_SUSPEND is about leaving the device in runtime suspend
> > across the entire system suspend transition if it is still
> > runtime-suspended after running its driver's ->suspend() callback.
> >
> > This means in particular that if DPM_FLAG_SMART_SUSPEND is set, the
> > ->suspend() callback may run in parallel with xe_pci_runtime_resume(),
> > so it cannot be the current xe_pci_suspend() because it cannot
> > manipulate the device or call pci_save_state() on it or similar.  The
> > most straightforward way to overcome this would be to point
> > ->suspend_late() to xe_pci_suspend(), but this would cause the device
> > to be suspended later unless it was runtime-suspended already.  In
> > turn, suspending it later may cause the total duration of the system
> > suspend transition to increase (because the device will now be
> > suspended in a different phase of the transition).
> >
> > The suspend time can be distributed somewhat differently if the core
> > is allowed to take care of putting the device into a low-power state.
> > Simply removing the pci_save_state() and pci_set_power_state() calls
> > from xe_pci_suspend() should be sufficient to make this happen and
> > that would be my first step (after fixing the device disable/enable
> > issue described above).
> >
> > Next, you can point ->suspend_late(), instead of ->suspend(), to
> > xe_pci_suspend() and that is still expected to work, but the total
> > system suspend timing will change.
> >
> > If this works, you'll be ready to set DPM_FLAG_SMART_SUSPEND.
>
>  From above what I understand, setting DPM_FLAG_SMART_SUSPEND enforces
> the requirement that the device must be
> runtime suspended in order to be suspended.

No, it doesn't.