[PATCH] drm/xe/d3cold: Set power state to D3Cold during s2idle/s3

Nilawar, Badal badal.nilawar at intel.com
Thu Apr 3 07:36:51 UTC 2025


On 02-04-2025 01:05, Rafael J. Wysocki wrote:
> On Tue, Apr 1, 2025 at 7:53 PM Raag Jadav <raag.jadav at intel.com> wrote:
>> On Mon, Mar 31, 2025 at 10:18:16PM +0200, Wysocki, Rafael J wrote:
>>> On 3/31/2025 6:15 PM, Rodrigo Vivi wrote:
>>>> On Sat, Mar 29, 2025 at 07:20:37AM +0200, Raag Jadav wrote:
>>>>> On Fri, Mar 28, 2025 at 12:02:17PM -0400, Rodrigo Vivi wrote:
>>>>>> On Thu, Mar 27, 2025 at 01:14:09PM -0400, Rodrigo Vivi wrote:
>>>>>>> On Thu, Mar 27, 2025 at 10:02:29PM +0530, Nilawar, Badal wrote:
>>>>>>>> On 27-03-2025 21:49, Badal Nilawar wrote:
>>>>>>>> Hi Rodrigo,
>>>>>>>>
>>>>>>>>> According to pci core guidelines, pci_save_config is recommended when the
>>>>>>>>> driver explicitly needs to set the pci power state. As of now xe kmd is
>>>>>>>>> only doing pci_save_config while entering to s2idle/s3 state, which makes
>>>>>>>>> pci core think that device driver has already applied required pci power
>>>>>>>>> state. This leads to GPU remain in D0 state. To fix the issue setting
>>>>>>>>> the pci power state to D3Cold.
>>>>>>>>>
>>>>>>>>> Fixes:dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
>>>>>>>>> Cc: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
>>>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>>>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
>>>>>>>>> Signed-off-by: Anshuman Gupta <anshuman.gupta at intel.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/xe/xe_pci.c | 1 +
>>>>>>>>>     1 file changed, 1 insertion(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>>>>>>>>> index 7046e7e9a6c7..3317d475be79 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pci.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>>>>>>>>> @@ -932,6 +932,7 @@ static int xe_pci_suspend(struct device *dev)
>>>>>>>>>            pci_save_state(pdev);
>>>>>>>>>            pci_disable_device(pdev);
>>>>>>>>> + pci_set_power_state(pdev, PCI_D3cold);
>>>>>>>> Another approach to avoid calling pci_save_state and pci_set_power_state,
>>>>>>>> allowing the PCI core to manage this.
>>>>>>>> Currently, the above change aligns with the Xe RPM suspend flow.
>>>>>>> Either way is fine it seems. Or we don't save the state and let pci subsystem
>>>>>>> handle that for us or we save and set explicitly. So, let's move quickly
>>>>>>> with this option here that is already fixing our current issue.
>>>>> On our way to become another i915 now, are we?
>>>> could you please expand on this?
>>>>
>>>> A simple git grep on pci_set_power_state and on pci_save_state
>>>> return many more entries than i915.
>>>>
>>>> what am I missing?
>> It's not about PM at all, just a reference to the infinite
>>
>> [1] more code -> merge -> dammit -> fix -> [1]
>>
>> cycle.
>>
>>>>> While this might be a good enough band-aid, we should probably explore
>>>>> how DPM_FLAG_SMART_* work and stop mixing/matching approaches to hide
>>>>> issues.
>>>> perhaps this indeed...
>>>> perhaps the ones doing power state themselves are the ones declaring 'smart'?
>>>>
>>>> Well, on a dumb script here it looks we have over 110 drivers using pci_save_state
>>>> and only 6 of those using DPM_FLAG_SMART_*
>>>>
>>>> So, I agree that it might be a good idea to explore things here and find the
>>>> optimal settings.
>>> See https://www.kernel.org/doc/html/latest/driver-api/pm/devices.html#the-dpm-flag-smart-suspend-driver-flag
>>>
>>> It's fairly accurate.
>>>
>>> The rule of thumb is to avoid mixing these with calling pci_save_state()
>>> directly and setting device state from a driver callback.
>> That's not what I meant here. There are multiple issues at play.
>>
>> 1. An AER is reported[*] on root port during system suspend even before we
>>     reach any of the driver PM callback. From initial investigation it seems
>>     like a case of misbehaviour by some child device, but this is a different
>>     issue entirely.
>>
>> [*] irq/120-aerdrv-145     [002] .....  1264.981023: <stack trace>
>> => xe_pm_runtime_resume
>> => xe_pci_runtime_resume
>> => pci_pm_runtime_resume
>> => __rpm_callback
>> => rpm_callback
>> => rpm_resume
>> => __pm_runtime_resume
>> => pci_pm_runtime_get_sync
>> => __pci_walk_bus
>> => pci_walk_bus
>> => pcie_do_recovery
>> => aer_process_err_devices
>> => aer_isr
>>
>> 2. Setting explicit pci_set_power_state(pdev, PCI_D3cold) from xe_pm_suspend().
>>     Although we see many drivers do it for their case, it's quite a questionable
>>     choice (atleast IMHO) to hard suspend the device from driver PM callback
>>     without any regard to runtime_usage counter. It can hide potential issues
>>     like AER during system suspend (regardless of whether or not it is supported
>>     by the driver, since it is supposed to keep the device active on such a
>>     catastrophic failure anyway), but I'll leave it to the experts to decide.
> If the driver does not set DPM_FLAG_SMART_SUSPEND, and xe doesn't set
> it AFAICS (at least not in the mainline), pci_pm_suspend() will resume
> the device from runtime suspend before invoking its driver's callback.
>
> This guarantees that the device is always RPM_ACTIVE when
> xe_pci_suspend() runs and it cannot be runtime-suspended because the
> core is holding a runtime PM reference on it (and so the runtime PM
> usage counter is always greater than zero when xe_pci_suspend() runs).
>
> This means that neither xe_pci_runtime_suspend() nor
> xe_pci_runtime_resume() can run concurrently with respect to it, so
> xe_pci_suspend() can change the power state of the device etc. safely.
> Of course, it needs to call pci_save_state() before doing any of that,
> but it does so.
>
> This is expected to work, but resuming the device from runtime suspend
> during a system suspend transition can be avoided (see below).
>
> However, on the resume side, pci_pm_resume_noirq() runs first and
> because dev_pm_skip_resume() returns 'false' (since
> DPM_FLAG_SMART_SUSPEND is unset) and skip_bus_pm is 'false' (since the
> device is not in D0), it will call pci_pm_default_resume_early() which
> will (1) put the device into D0 unconditionally and (2) restore its
> state.
>
> This means that by the time when xe_pci_resume() runs, the power state
> of the device is already D0 and its state has already been restored,
> so the initial part of it is arguably redundant.
>
> There also appears to be a problem with calling pci_disable_device()
> after pci_save_state() in xe_pci_suspend() because restoring the
> device's state in pci_pm_default_resume_early() will inadvertently
> cause it to be enabled AFAICS.  I would call pci_save_state() after
> pci_disable_device(), so the device would still be disabled after
> running pci_pm_default_resume_early() and xe_pci_resume() would enable
> it as appropriate.
>
>> 2. Since we're already setting D0 and D3 states in our runtime PM callbacks,
>>     utilizing DPM_FLAG_SMART_* flags will allow us to skip unnecessary
>>     runtime_resume during system suspend and let PCI PM take care of all
>>     the PCI stuff in a smarter way (even skip D3hot/cold if it's already in it)
>>     without us needing to duplicate code and do everything manually.
> That's true in principle, but care needs to be taken.
>
> First of all, DPM_FLAG_SMART_PREPARE and DPM_FLAG_SMART_SUSPEND are
> about different things.
>
> DPM_FLAG_SMART_PREPARE is about the direct-complete optimization that
> allows the suspend and resume of the device to be skipped completely
> if some additional conditions are met and it allows the driver to
> control the behavior related to this optimization (see "Entering
> System Suspend" in Documentation/driver-api/pm/devices.rst for
> details).  The bottom line here is that because xe doesn't provide a
> ->prepare() callback in xe_pm_ops, this flag will have no effect.  xe
> may want to set DPM_FLAG_NO_DIRECT_COMPLETE to disable the
> direct-complete optimization completely, though (i915 does this for
> some reason unclear to me).
>
> DPM_FLAG_SMART_SUSPEND is about leaving the device in runtime suspend
> across the entire system suspend transition if it is still
> runtime-suspended after running its driver's ->suspend() callback.
>
> This means in particular that if DPM_FLAG_SMART_SUSPEND is set, the
> ->suspend() callback may run in parallel with xe_pci_runtime_resume(),
> so it cannot be the current xe_pci_suspend() because it cannot
> manipulate the device or call pci_save_state() on it or similar.  The
> most straightforward way to overcome this would be to point
> ->suspend_late() to xe_pci_suspend(), but this would cause the device
> to be suspended later unless it was runtime-suspended already.  In
> turn, suspending it later may cause the total duration of the system
> suspend transition to increase (because the device will now be
> suspended in a different phase of the transition).
>
> The suspend time can be distributed somewhat differently if the core
> is allowed to take care of putting the device into a low-power state.
> Simply removing the pci_save_state() and pci_set_power_state() calls
> from xe_pci_suspend() should be sufficient to make this happen and
> that would be my first step (after fixing the device disable/enable
> issue described above).
>
> Next, you can point ->suspend_late(), instead of ->suspend(), to
> xe_pci_suspend() and that is still expected to work, but the total
> system suspend timing will change.
>
> If this works, you'll be ready to set DPM_FLAG_SMART_SUSPEND.

 From above what I understand, setting DPM_FLAG_SMART_SUSPEND enforces 
the requirement that the device must be
runtime suspended in order to be suspended. Is it valid with respect to 
xe kmd?



More information about the Intel-xe mailing list