Couple of issues with amdgpu on my WX4100

Mon Jan 4 18:43:35 UTC 2021

On Mon, 4 Jan 2021 18:39:33 +0100
Christian König <christian.koenig at amd.com> wrote:

> Am 04.01.21 um 17:45 schrieb Alex Williamson:
> > On Mon, 4 Jan 2021 12:34:34 +0100
> > Christian König <christian.koenig at amd.com> wrote:
> >  
> >> Hi Maxim,
> >>
> >> I can't help with the display related stuff. Probably best approach to
> >> get this fixes would be to open up a bug tracker for this on FDO.
> >>
> >> But I'm the one who implemented the resizeable BAR support and your
> >> analysis of the problem sounds about correct to me.
> >>
> >> The reason why this works on Linux is most likely because we restore the
> >> BAR size on resume (and maybe during initial boot as well).
> >>
> >> See this patch for reference:
> >>
> >> commit d3252ace0bc652a1a244455556b6a549f969bf99
> >> Author: Christian König <ckoenig.leichtzumerken at gmail.com>
> >> Date:   Fri Jun 29 19:54:55 2018 -0500
> >>
> >>       PCI: Restore resized BAR state on resume
> >>
> >>       Resize BARs after resume to the expected size again.
> >>
> >>       BugLink: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D199959&data=04%7C01%7Cchristian.koenig%40amd.com%7C942176d2e6aa4a4f3a4208d8b0d032bd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637453755549960615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3rsR%2Fx4uTpjtXFNqlJyFBteMmZMjWf3Neci7lUlkh88%3D&reserved=0
> >>       Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
> >>       Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
> >>       Signed-off-by: Christian König <christian.koenig at amd.com>
> >>       Signed-off-by: Bjorn Helgaas <bhelgaas at google.com>
> >>       CC: stable at vger.kernel.org      # v4.15+
> >>
> >>
> >> It should be trivial to add this to the reset module as well. Most
> >> likely even completely vendor independent since I'm not sure what a bus
> >> reset will do to this configuration and restoring it all the time should
> >> be the most defensive approach.  
> > Hmm, this should already be used by the bus/slot reset path:
> >
> > pci_bus_restore_locked()/pci_slot_restore_locked()
> >   pci_dev_restore()
> >    pci_restore_state()
> >     pci_restore_rebar_state()
> >
> > VFIO support for resizeable BARs has been on my todo list, but I don't
> > have access to any systems that have both a capable device and >4G
> > decoding enabled in the BIOS.  If we have a consistent view of the BAR
> > size after the BARs are expanded, I'm not sure why it doesn't just
> > work.  FWIW, QEMU currently hides the REBAR capability to the guest
> > because the kernel driver doesn't support emulation through config
> > space (ie. it's read-only, which the spec doesn't support).  
> 
> In this case the guest shouldn't be able to change the config at all and 
> I have no idea what's going wrong here.
> 
> > AIUI, resource allocation can fail when enabling REBAR support, which
> > is a problem if the failure occurs on the host but not the guest since
> > we have no means via the hardware protocol to expose such a condition.
> > Therefore the model I was considering for vfio-pci would be to simply
> > pre-enable REBAR at the max size.  
> 
> That's a rather bad idea. See our GPUs for example return way more than 
> they actually need.
> 
> E.g. a Polaris usually returns 4GiB even when only 2GiB are installed, 
> because 4GiB is just the maximum amount of RAM you can put together with 
> the ASIC on a board.

Would the driver fail or misbehave if the BAR is sized larger than the
amount of memory on the card or is memory size determined independently
of BAR size?

> Some devices even return a mask of all 1 even when they need only 2MiB, 
> resulting in nearly 1TiB of wasted address space with this approach.

Ugh.  I'm afraid to ask why a device with a 2MiB BAR would implement a
REBAR capability, but I guess we really can't make any assumptions
about the breadth of SKUs that ASIC might support (or sanity of the
designers).

We could probe to determine the maximum size the host can support and
potentially emulate the capability to remove sizes that we can't
allocate, but without any ability for the device to reject a size
advertised as supported via the capability protocol it makes me nervous
how we can guarantee the resources are available when the user
re-configures the device.  That might mean we'd need to reserve the
resources, up to what the host can support, regardless of what the
device can actually use.  I'm not sure how else to know how much to
reserve without device specific code in vfio-pci.  Thanks,

Alex