Couple of issues with amdgpu on my WX4100

Maxim Levitsky mlevitsk at redhat.com
Wed Jan 6 20:21:08 UTC 2021


On Mon, 2021-01-04 at 09:45 -0700, Alex Williamson wrote:
> On Mon, 4 Jan 2021 12:34:34 +0100
> Christian König <christian.koenig at amd.com> wrote:
> 
> > Hi Maxim,
> > 
> > I can't help with the display related stuff. Probably best approach to 
> > get this fixes would be to open up a bug tracker for this on FDO.
> > 
> > But I'm the one who implemented the resizeable BAR support and your 
> > analysis of the problem sounds about correct to me.
> > 
> > The reason why this works on Linux is most likely because we restore the 
> > BAR size on resume (and maybe during initial boot as well).
> > 
> > See this patch for reference:
> > 
> > commit d3252ace0bc652a1a244455556b6a549f969bf99
> > Author: Christian König <ckoenig.leichtzumerken at gmail.com>
> > Date:   Fri Jun 29 19:54:55 2018 -0500
> > 
> >      PCI: Restore resized BAR state on resume
> > 
> >      Resize BARs after resume to the expected size again.
> > 
> >      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199959
> >      Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
> >      Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
> >      Signed-off-by: Christian König <christian.koenig at amd.com>
> >      Signed-off-by: Bjorn Helgaas <bhelgaas at google.com>
> >      CC: stable at vger.kernel.org      # v4.15+
> > 
Hi!
Thanks for the feedback!
 
So I went over qemu code and indeed the qemu (as opposed to the kernel
where I tried to hide the PCI_EXT_CAP_ID_REBAR) indeed does hide this
pci capability from the guest.
 
However exactly as Alex mentioned the kernel does indeed restore
the rebar state, and even with that code patched out I found out that
rebar state persists across the reset that the vendor_reset module 
does (BACO I think).
 
Therefore the Linux guest sees the full 4G bar and happily uses it, 
while the windows guest's driver apparently has a bug when the bar
is that large.
 
I patched the amdgpu to resize the bar to various other sizes, and
the windows driver apparently works up to a 2GB bar.
 
So pretty much other than a bug in the windows driver, and fact
that VFIO doesn't support resizable bars there is nothing wrong here.
 
Since my system does support above 4G decoding and I do have a nice
vfio friendly device that does support a resizable bar, I do volunteer
to add support for this to VFIO as time and resources permit.
 
Also it would be nice if it was either possible to make amdgpu 
(or the whole system) optionally avoid resizing bars when a 
kernel command line / module param is given,
or even better let the amdgpu resize the bar to its original
size when it is unloaded which IMHO is the best solution 
for this problem.
 
I think I can prepare a patch to make amdgpu restore 
the bar size on unload if you think that
this is the right solution.

> > 
> > It should be trivial to add this to the reset module as well. Most 
> > likely even completely vendor independent since I'm not sure what a bus 
> > reset will do to this configuration and restoring it all the time should 
> > be the most defensive approach.

> 
> Hmm, this should already be used by the bus/slot reset path:
> 
> pci_bus_restore_locked()/pci_slot_restore_locked()
>  pci_dev_restore()
>   pci_restore_state()
>    pci_restore_rebar_state()
> 
> VFIO support for resizeable BARs has been on my todo list, but I don't
> have access to any systems that have both a capable device and >4G
> decoding enabled in the BIOS.  If we have a consistent view of the BAR
> size after the BARs are expanded, I'm not sure why it doesn't just
> work.  FWIW, QEMU currently hides the REBAR capability to the guest
> because the kernel driver doesn't support emulation through config
> space (ie. it's read-only, which the spec doesn't support).
> 
> AIUI, resource allocation can fail when enabling REBAR support, which
> is a problem if the failure occurs on the host but not the guest since
> we have no means via the hardware protocol to expose such a condition.
> Therefore the model I was considering for vfio-pci would be to simply
> pre-enable REBAR at the max size.  It might be sufficiently safe to
> test BAR expansion on initialization and then allow user control, but
> I'm concerned that resource availability could change while already in
> use by the user.  Thanks,

As mentioned in other replies in this thread and what my first
thought about this, this will indeed will break on devices which
don't accurately report the maximum bar size that they actually need.
Even the spec itself says that it is vendor specific to determine the
optimal bar size.

We can also allow guest to resize the bar and if that fails,
expose the error via a virtual AER message on the root port
where the device is attached? 

I personally don't know if this is possible/worth it.


Best regards,
	Maxim Levitsky

> 
> Alex




More information about the amd-gfx mailing list