[PATCH] drm/radeon: fix asic initialization for virtualized environments

Wed Jun 15 16:45:17 UTC 2016

On Wed, 15 Jun 2016 02:23:37 -0400
Alex Deucher <alexdeucher at gmail.com> wrote:

> On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson
> <alex.williamson at redhat.com> wrote:
> > On Mon, 13 Jun 2016 15:45:20 -0400
> > Alex Deucher <alexdeucher at gmail.com> wrote:
> >  
> >> When executing in a PCI passthrough based virtuzliation environment, the
> >> hypervisor will usually attempt to send a PCIe bus reset signal to the
> >> ASIC when the VM reboots. In this scenario, the card is not correctly
> >> initialized, but we still consider it to be posted. Therefore, in a
> >> passthrough based environemnt we should always post the card to guarantee
> >> it is in a good state for driver initialization.
> >>
> >> Ported from amdgpu commit:
> >> amdgpu: fix asic initialization for virtualized environments
> >>
> >> Cc: Andres Rodriguez <andres.rodriguez at amd.com>
> >> Cc: Alex Williamson <alex.williamson at redhat.com>
> >> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >> Cc: stable at vger.kernel.org
> >> ---
> >>  drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++
> >>  1 file changed, 21 insertions(+)  
> >
> > Thanks, I expect it's an improvement, though it's always a bit
> > disappointing when a driver starts modifying its behavior based on
> > what might be a transient feature of the platform, in this case a
> > hypervisor platform.  For instance, why does our bus reset and video
> > ROM execution result in a different state than a physical BIOS doing
> > the same?  Can't this condition occur regardless of a hypervisor,  
> 
> Just doing a pci reset is not enough on newer cards.  The hw handling
> pci resets changed in CI and more of the logic moved to the driver.

Gag, please relay my disapproval to your hardware folks.

> That does a limited reset, but not the registers that the driver
> checks to determine whether or not the asic has been posted so the
> driver skips posting and leaves the hw in a bad reset state.
> 
> > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps
> > simply a system BIOS optimized to post a limited set of devices.  
> 
> We can tell if a card has never been posted and properly post it.
> Where it's tricky is when a card has been posted and has subsequently
> been pci reset on CI and newer hw.  I'm not sure of a good way to
> detect this particular scenario.  Generally this is mainly done for
> qemu/kvm.

How do you tell if a card has never been posted?  Is it something we
could easily toggle after a bus reset?

> > Detection based on some state of the device rather than an expectation
> > based on what the device is running on seems preferable.  I suspect
> > Andres' patch for amdgpu only affects newer devices, which pretty much
> > all suffer reset issues, at least under QEMU/VFIO, but I wonder how this
> > patch affects existing working devices, like 6, 7, and some 8-series.  
> 
> Posting the asic at init time should be safe on all asics.
> 
> > Anyway, if this is the solution to the poor behavior we've seen with
> > assigned AMD cards, maybe someone could request the same for the closed
> > drivers, including Windows.  Thanks,  
> 
> The closed drivers already do this.

Hmm, that's not terribly encouraging then since the majority of users
are running Windows guests for the purpose of creating a gaming VM and
still experiencing reset issues with the closed drivers there.  Thanks,

Alex