[Nouveau] [PATCH v2 4/4] pci: save the boot pcie link speed and restore it on fini

Tue May 21 13:56:24 UTC 2019

On Tue, May 21, 2019 at 3:51 PM Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>
> On Tue, May 21, 2019 at 9:29 AM Karol Herbst <kherbst at redhat.com> wrote:
> >
> > On Tue, May 21, 2019 at 3:11 PM Bjorn Helgaas <helgaas at kernel.org> wrote:
> > >
> > > On Tue, May 21, 2019 at 12:30:38AM +0200, Karol Herbst wrote:
> > > > On Mon, May 20, 2019 at 11:20 PM Bjorn Helgaas <helgaas at kernel.org> wrote:
> > > > > On Tue, May 07, 2019 at 10:12:45PM +0200, Karol Herbst wrote:
> > > > > > Apperantly things go south if we suspend the device with a different PCIE
> > > > > > link speed set than it got booted with. Fixes runtime suspend on my gp107.
> > > > > >
> > > > > > This all looks like some bug inside the pci subsystem and I would prefer a
> > > > > > fix there instead of nouveau, but maybe there is no real nice way of doing
> > > > > > that outside of drivers?
> > > > >
> > > > > I agree it would be nice to fix this in the PCI core if that's
> > > > > feasible.
> > > > >
> > > > > It looks like this driver changes the PCIe link speed using some
> > > > > device-specific mechanism.  When we suspend, we put the device in
> > > > > D3cold, so it loses all its state.  When we resume, the link probably
> > > > > comes up at the boot speed because nothing did that device-specific
> > > > > magic to change it, so you probably end up with the link being slow
> > > > > but the driver thinking it's configured to be fast, and maybe that
> > > > > combination doesn't work.
> > > > >
> > > > > If it requires something device-specific to change that link speed, I
> > > > > don't know how to put that in the PCI core.  But maybe I'm missing
> > > > > something?
> > > > >
> > > > > Per the PCIe spec (r4.0, sec 1.2):
> > > > >
> > > > >   Initialization – During hardware initialization, each PCI Express
> > > > >   Link is set up following a negotiation of Lane widths and frequency
> > > > >   of operation by the two agents at each end of the Link. No firmware
> > > > >   or operating system software is involved.
> > > > >
> > > > > I have been assuming that this means device-specific link speed
> > > > > management is out of spec, but it seems pretty common that devices
> > > > > don't come up by themselves at the fastest possible link speed.  So
> > > > > maybe the spec just intends that devices can operate at *some* valid
> > > > > speed.
> > > >
> > > > I would expect that devices kind of have to figure out what they can
> > > > operate on and the operating system kind of just checks what the
> > > > current state is and doesn't try to "restore" the old state or
> > > > something?
> > >
> > > The devices at each end of the link negotiate the width and speed of
> > > the link.  This is done directly by the hardware without any help from
> > > the OS.
> > >
> > > The OS can read the current link state (Current Link Speed and
> > > Negotiated Link Width, both in the Link Status register).  The OS has
> > > very little control over that state.  It can't directly restore the
> > > state because the hardware has to negotiate a width & speed that
> > > result in reliable operation.
> > >
> > > > We don't do anything in the driver after the device was suspended. And
> > > > the 0x88000 is a mirror of the PCI config space, but we also got some
> > > > PCIe stuff at 0x8c000 which is used by newer GPUs for gen3 stuff
> > > > essentially. I have no idea how much of this is part of the actual pci
> > > > standard and how much is driver specific. But the driver also wants to
> > > > have some control over the link speed as it's tight to performance
> > > > states on GPU.
> > >
> > > As far as I'm aware, there is no generic PCIe way for the OS to
> > > influence the link width or speed.  If the GPU driver needs to do
> > > that, it would be via some device-specific mechanism.
> > >
> > > > The big issue here is just, that the GPU boots with 8.0, some on-gpu
> > > > init mechanism decreases it to 2.5. If we suspend, the GPU or at least
> > > > the communication with the controller is broken. But if we set it to
> > > > the boot speed, resuming the GPU just works. So my assumption was,
> > > > that _something_ (might it be the controller or the pci subsystem)
> > > > tries to force to operate on an invalid link speed and because the
> > > > bridge controller is actually powered down as well (as all children
> > > > are in D3cold) I could imagine that something in the pci subsystem
> > > > actually restores the state which lets the controller fail to
> > > > establish communication again?
> > >
> > >   1) At boot-time, the Port and the GPU hardware negotiate 8.0 GT/s
> > >      without OS/driver intervention.
> > >
> > >   2) Some mechanism reduces link speed to 2.5 GT/s.  This probably
> > >      requires driver intervention or at least some ACPI method.
> > >
> >
> > there is no driver intervention and Nouveau doesn't care at all. It's
> > all done on the GPU. We just upload a script and some firmware on to
> > the GPU. The script runs then on the PMU inside the GPU and this
> > script also causes changing the PCIe link settings. But from a Nouveau
> > point of view we don't care about the link before or after that script
> > was invoked. Also there is no ACPI method involved.
> >
> > But if there is something we should notify pci core about, maybe
> > that's something we have to do then?
> >
> > >   3) Suspend puts GPU into D3cold (powered off).
> > >
> > >   4) Resume restores GPU to D0, and the Port and GPU hardware again
> > >      negotiate 8.0 GT/s without OS/driver intervention, just like at
> > >      initial boot.
> > >
> >
> > No, that negotiation fails apparently as any attempt to read anything
> > from the device just fails inside pci core. Or something goes wrong
> > when resuming the bridge controller.
> >
> > >   5) Now the driver thinks the GPU is at 2.5 GT/s but it's actually at
> > >      8.0 GT/s.
> > >
> >
> > what is actually meant by "driver" here? The pci subsystem or Nouveau?
> >
> > > Without knowing more about the transition to 2.5 GT/s, I can't guess
> > > why the GPU wouldn't work after resume.  From a PCIe point of view,
> > > the link is supposed to work and the device should be reachable
> > > independent of the link speed.  But maybe there's some weird
> > > dependency between the GPU and the driver here.
> > >
> >
> > but the device isn't reachable at all, not even from the pci
> > subsystem. All reads fail/return a default error value (0xffffffff).
> >
> > > It sounds like things work if you return to 8.0 GT/s before suspend,
> > > things work.  That would make sense to me because then the driver's
> > > idea of the link state after resume would match the actual state.
> > >
> >
> > depends on what is meant by the driver here. Inside Nouveau we don't
> > care one bit about the current link speed, so I assume you mean
> > something inside the pci core code?
> >
> > > But I don't see a way to deal with this in the PCI core.  The PCI core
> > > does save and restore most of the architected config space around
> > > suspend/resume, but since this appears to be a device-specific thing,
> > > the PCI core would have no idea how to save/restore it.
> > >
> >
> > if we assume that the negotiation on a device level works as intended,
> > then I would expect this to be a pci core issue, which might actually
> > be not fixable there. But if it's not, then we would have to put
> > something like that inside the runpm documentation to tell drivers
> > they have to do something about it.
> >
> > But again, for me it just sounds like the negotiation on the device
> > level fails or something inside pci core messes it up.
>
> Bjorn -- nouveau has a way of requesting that the GPU change PCIe
> settings. It sets the PCIe version to the max version (esp older GPUs
> tended to boot as PCIe 1.0, and had to be set to 2.0/3.0 "by hand"),
> and then the link speed is adjusted based on the perf level settings
> by writing to a PCI config-ish mmio space -- however on the GPUs that
> Karol is talking about, we can't do the perf level adjustments, so
> nouveau never touches the speed. (Does it touch the PCIe version? Not
> 100% sure ... Karol?)

I think we only do it if the GPU comes up as v1, but that was mainly a
tesla thing, saw it on Fermi a few times, but never on newer chips.
And we also only do it if the pci->func->pcie.version callback was set
(which we don't do on Pascal, and this is the gen where we have the
runpm issue).

> In this case, it sounds like it's firmware
> running on the GPU which is doing this (probably using the exact same
> mechanism nouveau would -- those internal engines also have access to
> the mmio space).
>
> Perhaps there's a way to capture PCI config space of both the GPU and
> its link partner, to see if there's anything obviously wrong? (But
> even if there is, doesn't sound like we have too much recourse...)
> From the sounds of it, the two link partners disagree on settings
> somehow and don't establish a proper link.
>
>   -ilia