[PATCH 03/12] PCI/ACPI: Add aux power grant notifier

Rafael J. Wysocki rafael at kernel.org
Thu Apr 3 18:15:10 UTC 2025


On Thu, Apr 3, 2025 at 6:09 PM Gupta, Anshuman <anshuman.gupta at intel.com> wrote:
>
> > -----Original Message-----
> > From: Rafael J. Wysocki <rafael at kernel.org>
> > Sent: Thursday, April 3, 2025 7:04 PM
> > To: Gupta, Anshuman <anshuman.gupta at intel.com>
> > Cc: Rafael J. Wysocki <rafael at kernel.org>; Bjorn Helgaas
> > <helgaas at kernel.org>; intel-xe at lists.freedesktop.org; linux-
> > acpi at vger.kernel.org; linux-pci at vger.kernel.org; lenb at kernel.org;
> > bhelgaas at google.com; ilpo.jarvinen at linux.intel.com; De Marchi, Lucas
> > <lucas.demarchi at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>; Nilawar,
> > Badal <badal.nilawar at intel.com>; Gupta, Varun <varun.gupta at intel.com>;
> > ville.syrjala at linux.intel.com; Shankar, Uma <uma.shankar at intel.com>
> > Subject: Re: [PATCH 03/12] PCI/ACPI: Add aux power grant notifier
> >
> > On Thu, Apr 3, 2025 at 1:30 PM Gupta, Anshuman
> > <anshuman.gupta at intel.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Rafael J. Wysocki <rafael at kernel.org>
> > > > Sent: Wednesday, April 2, 2025 4:54 PM
> > > > To: Bjorn Helgaas <helgaas at kernel.org>; Gupta, Anshuman
> > > > <anshuman.gupta at intel.com>
> > > > Cc: intel-xe at lists.freedesktop.org; linux-acpi at vger.kernel.org;
> > > > linux- pci at vger.kernel.org; rafael at kernel.org; lenb at kernel.org;
> > > > bhelgaas at google.com; ilpo.jarvinen at linux.intel.com; De Marchi, Lucas
> > > > <lucas.demarchi at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>;
> > > > Nilawar, Badal <badal.nilawar at intel.com>; Gupta, Varun
> > > > <varun.gupta at intel.com>; ville.syrjala at linux.intel.com; Shankar, Uma
> > > > <uma.shankar at intel.com>
> > > > Subject: Re: [PATCH 03/12] PCI/ACPI: Add aux power grant notifier
> > > >
> > > > On Tue, Apr 1, 2025 at 10:13 PM Bjorn Helgaas <helgaas at kernel.org>
> > wrote:
> > > > >
> > > > > On Tue, Apr 01, 2025 at 09:02:16PM +0530, Anshuman Gupta wrote:
> > > > > > Adding a notifier to notify all PCIe child devices about the
> > > > > > block main power removal. It is needed because theoretically
> > > > > > multiple PCIe Endpoint devices on same Root Port can request for
> > > > > > AUX power and _DSM method can return with 80000000h signifies
> > > > > > that the hierarchy connected via the slot cannot support main
> > > > > > power removal when in D3Cold.
> > > > >
> > > > > I wish the spec used different language here.  "D3cold" *means*
> > > > > "main power is removed" (PCIe r6.0, sec 5.3.1.4.2), so it doesn't
> > > > > make sense to say that "the slot cannot support main power removal
> > > > > when in D3cold".  If a device is in D3cold, its main power has
> > > > > been removed by definition.
> > > > >
> > > > > I suppose the spec is trying to say if the driver has called this
> > > > > _DSM with 80000000h, it means the platform must not remove main
> > > > > power from the device while the system is in S0?  Is that what you
> > > > > think it means?
> > > > >
> > > > > The 2h return value description says it "indicates that the
> > > > > platform will not remove main power from the slot while the system is in
> > S0,"
> > > > > so I guess that must be it.
> > > > >
> > > > > In this series, pci_acpi_aux_power_setup() only supplies a 16-bit
> > > > > aux_pwr_limit value, so the driver cannot *request* that the
> > > > > platform not remove main power.
> > > > >
> > > > > So I guess the only scenario where the _DSM returns 80000000h is
> > > > > when the platform itself or other devices prevent the removal of
> > > > > main power.  And the point of the notifier is to tell devices that
> > > > > their main power will never be removed while the system is in S0.
> > > > > Is that right?
> > > > >
> > > > > > This may disrupt functionality of other child device.
> > > > >
> > > > > What sort of disruption could happen?  I would think that if the
> > > > > _DSM returns 80000000h, it just means the device will not have
> > > > > main power removed, so it won't see that power state transition.
> > > > > The only "disruption" would be that we're using more power.
> > > > >
> > > > > > Let's notify all PCIe devices requested Aux power resource and
> > > > > > Let PCIe End Point driver handle it in its callback.
> > > > >
> > > > > Wrap to fill 75 columns.
> > > > >
> > > > > s/Adding/Add/
> > > > > s/Let's notify/Notify/
> > > > > s/and Let/and let/
> > > > > s/End Point/Endpoint/ (several places here and below)
> > > > >
> > > > > > Signed-off-by: Anshuman Gupta <anshuman.gupta at intel.com>
> > > > > > ---
> > > > > >  drivers/pci/pci-acpi.c   | 34 +++++++++++++++++++++++++++++++---
> > > > > >  include/linux/pci-acpi.h | 13 +++++++++++++
> > > > > >  2 files changed, 44 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> > > > > > index
> > > > > > 04149f037664..d1ca1649e6e8 100644
> > > > > > --- a/drivers/pci/pci-acpi.c
> > > > > > +++ b/drivers/pci/pci-acpi.c
> > > > > > @@ -1421,6 +1421,32 @@ static void
> > > > > > pci_acpi_optimize_delay(struct
> > > > pci_dev *pdev,
> > > > > >       ACPI_FREE(obj);
> > > > > >  }
> > > > > >
> > > > > > +static BLOCKING_NOTIFIER_HEAD(pci_acpi_aux_power_notify_list);
> > > > > > +
> > > > > > +/**
> > > > > > + * pci_acpi_register_aux_power_notifier - Register driver
> > > > > > +notifier
> > > > > > + * @nb: notifier block
> > > > > > + *
> > > > > > + * This function shall be called by PCIe End Point device
> > > > > > +requested the Aux
> > > > > > + * power resource in order to handle the any scenario
> > > > > > +gracefully when other
> > > > > > + * child PCIe devices Aux power request returns with No main
> > > > > > +power
> > > > removal.
> > > > > > + * PCIe devices which register this notifier shall handle No
> > > > > > +main power
> > > > > > + * removal scenario accordingly.
> > > > >
> > > > > This would actually be called by the *driver* (not by the device).
> > > >
> > > Hi Rafael,
> > > Thanks for review.
> > > > Apart from this, there seems to be a design issue here because it
> > > > won't notify every driver that has requested the Aux power, just the
> > > > ones that have also registered notifiers.
> > > IMHO that is what intended, if any device has functional impact in our
> > > case it is INTEL GPU has functional impact if other PCIe device's (on
> > > same root port) Aux Power request returns with 0x2. We need to handle
> > such case to handle it gracefully.
> > > >
> > > > So this appears to be an opt-in from getting notifications on Aux
> > > > power request rejection responses to requests from other drivers
> > > > requesting Aux power for the same Root Port, but the changelog of
> > > > the first patch already claimed that the aggregation of requests was
> > > > not supported.  So if only one driver will be allowed to request the
> > > > Aux power for the given Root Port, why would the notifiers be necessary
> > after all?
> > > >
> > > Please guide us, if we remove the above limitation from the commit log.
> > > Then do we have any design issue with notifier approach ?
> >
> > Exactly like I said: If you only allow one driver to use the _DSM to request the
> > Aux power from a given Root Port, it will have all of the information and does
> > not need to be notified about any changes.  Since no one else is allowed to use
> > this interface for that Root Port, no one else will need a notifier either.  For this
> > to work, you need some mechanism allowing drivers to claim the interface so
> > no one else can use it (on a per Root Port basis) which is currently missing
> > AFAICS.
>
> IMHO such kind of mechanism will require to add Root Port specific data structure to claim
> the interface. But real problem is the criteria  to claim the interface.
> Is first PCIe Non-Bridge Endpoint Function 0 driver can be criteria  to claim the
> Interface. Or first come and first serve approach ?

IMV, the first PCIe Non-Bridge Endpoint Function 0 driver approach
would be sort of fragile and cumbersome to enforce.

First come, first serve is much simpler and should be sufficient for now AFAICS.

> > I think that this is the option you want to pursue.
> >
> We will prefer to stick with this option, if it can guaranty that XeKMD will the only driver
> allowed to request Aux power to enable VRSR.
>
> > OTOH, if you want to allow multiple drivers to use this interface for the same
> > Root Port, you need to aggregate the requests (as required by the spec IIUC) in
> > the first place, or else the users can override each other's request.  In that case
>
> INHO how Linux Kernel will aggregate the Aux Power request, without knowing the
> total Aux power budget. AFAIU aggregated Aux power request without knowledge of total
> power budget can also be denied by BIOS[1].
> Therefore how aggregation can be useful ?

Say two drivers request the Aux power from the same Root Port and each
of them passes a limit sufficient for its device.  Without
aggregation, what guarantees that the last limit passed will be
sufficient for both devices?

Also see the spec provision regarding retries below.

> [1]As per r3.3 Section 4.6.10
> Platform Firmware Budgeting of Aux Power Availability:
> Platform firmware must not grant more power than what is available within the system. For
> example, a board may be designed with four CEM slots (one x16 slot, one x4 slot, and two x1
> slots). The board may implement a power delivery circuit capable of supplying 2 W of power for
> the 3.3 Vaux rail supplying all 4 slots. The 3.3 Vaux pins on each CEM slot can supply 1 W each.
> Platform firmware may use the retry mechanism to prioritize requests from devices in preferred
> slots in the following manner:
>
> -> Requests from a device in the highest priority slot (e.g., x16) are granted immediately, if
> available.
>
> -> Requests from devices in lower priority slots (e.g., x2, x1) are initially rejected, with a retry
> interval inversely proportional to the slot priority. For instance, if the x2 slot is higher priority
> than the x1 slots, so the retry interval for the x2 slot may be 4 seconds, and the x1 slots may be
> 8 and 10 seconds.
>
> ->As requests are granted, the granted values are subtracted from the available budget.
>  Retried requests are granted based on the remaining power budget or denied if insufficient
> power budget is available. Another retry is not permitted.
>
> -> When there is insufficient power budget for a request, firmware may choose to keep main
> power on and return no power removal (2h).

This means that the platform firmware is supposed to do its own
aggregation, but at the system (board) level, not at the Root Port
level.

The algorithm is described above and my understanding of it is that
once a request has been granted, it cannot be retracted later.  Also
it appears to assume a 1:1 correspondence between PCIe slots and Root
Ports.

The guys who haven't been granted their requests at once may be asked
to try again later and there may not be enough Aux power for the last
guys at all, so they will not get it and they will stay on the main
power.  And again, this appears to be at the Root Port granularity.


More information about the Intel-xe mailing list