[PATCH] drm/radeon: fix asic initialization for virtualized environments

Rodriguez, Andres Andres.Rodriguez at amd.com
Thu Jun 16 15:43:38 UTC 2016



> -----Original Message-----
> From: Alex Deucher [mailto:alexdeucher at gmail.com]
> Sent: June-15-16 1:00 PM
> To: Alex Williamson
> Cc: Maling list - DRI developers; Deucher, Alexander; Rodriguez, Andres; for
> 3.8
> Subject: Re: [PATCH] drm/radeon: fix asic initialization for virtualized
> environments
> 
> On Wed, Jun 15, 2016 at 12:45 PM, Alex Williamson
> <alex.williamson at redhat.com> wrote:
> > On Wed, 15 Jun 2016 02:23:37 -0400
> > Alex Deucher <alexdeucher at gmail.com> wrote:
> >
> >> On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson
> >> <alex.williamson at redhat.com> wrote:
> >> > On Mon, 13 Jun 2016 15:45:20 -0400
> >> > Alex Deucher <alexdeucher at gmail.com> wrote:
> >> >
> >> >> When executing in a PCI passthrough based virtuzliation
> >> >> environment, the hypervisor will usually attempt to send a PCIe
> >> >> bus reset signal to the ASIC when the VM reboots. In this
> >> >> scenario, the card is not correctly initialized, but we still
> >> >> consider it to be posted. Therefore, in a passthrough based
> >> >> environemnt we should always post the card to guarantee it is in a
> good state for driver initialization.
> >> >>
> >> >> Ported from amdgpu commit:
> >> >> amdgpu: fix asic initialization for virtualized environments
> >> >>
> >> >> Cc: Andres Rodriguez <andres.rodriguez at amd.com>
> >> >> Cc: Alex Williamson <alex.williamson at redhat.com>
> >> >> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >> >> Cc: stable at vger.kernel.org
> >> >> ---
> >> >>  drivers/gpu/drm/radeon/radeon_device.c | 21
> +++++++++++++++++++++
> >> >>  1 file changed, 21 insertions(+)
> >> >
> >> > Thanks, I expect it's an improvement, though it's always a bit
> >> > disappointing when a driver starts modifying its behavior based on
> >> > what might be a transient feature of the platform, in this case a
> >> > hypervisor platform.  For instance, why does our bus reset and
> >> > video ROM execution result in a different state than a physical
> >> > BIOS doing the same?  Can't this condition occur regardless of a
> >> > hypervisor,
> >>
> >> Just doing a pci reset is not enough on newer cards.  The hw handling
> >> pci resets changed in CI and more of the logic moved to the driver.
> >
> > Gag, please relay my disapproval to your hardware folks.
> >
> >> That does a limited reset, but not the registers that the driver
> >> checks to determine whether or not the asic has been posted so the
> >> driver skips posting and leaves the hw in a bad reset state.
> >>
> >> > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or
> >> > perhaps simply a system BIOS optimized to post a limited set of devices.
> >>
> >> We can tell if a card has never been posted and properly post it.
> >> Where it's tricky is when a card has been posted and has subsequently
> >> been pci reset on CI and newer hw.  I'm not sure of a good way to
> >> detect this particular scenario.  Generally this is mainly done for
> >> qemu/kvm.
> >
> > How do you tell if a card has never been posted?  Is it something we
> > could easily toggle after a bus reset?
> 
> We check CONFIG_MEMSIZE which is a scratch register set by the asic_init
> command table to tell the driver how much vram is on the board.
> 

Yeah, detecting a specific virtualization environment is something that I don't
really like. Specially since there isn't a nice generic way to do this for all
environments (i.e. in this patch I used x86 specific functionality). If there is a 
generic approach that works on multiple host CPU architectures do let me know.

Another approach I considered was adding an option, amdgpu.always_post
that would shift the responsibility onto the user. But I don't think it's really fair
to have that expectation, and at some point you start falling into config hell.

As Alex mentioned, we don't really have a good mechanism for detecting
when we need to post due to how the HW handles PCIe bus reset. Hopefully
we can get this fixed for future ASICs.

> >
> >> > Detection based on some state of the device rather than an
> >> > expectation based on what the device is running on seems
> >> > preferable.  I suspect Andres' patch for amdgpu only affects newer
> >> > devices, which pretty much all suffer reset issues, at least under
> >> > QEMU/VFIO, but I wonder how this patch affects existing working
> devices, like 6, 7, and some 8-series.
> >>
> >> Posting the asic at init time should be safe on all asics.
> >>
> >> > Anyway, if this is the solution to the poor behavior we've seen
> >> > with assigned AMD cards, maybe someone could request the same for
> >> > the closed drivers, including Windows.  Thanks,
> >>
> >> The closed drivers already do this.
> >
> > Hmm, that's not terribly encouraging then since the majority of users
> > are running Windows guests for the purpose of creating a gaming VM and
> > still experiencing reset issues with the closed drivers there.
> > Thanks,
> 
> I'll have to check with the windows team to see how much validation they do
> with the windows driver as a qemu/kvm guest.  It could be that they don't
> properly detect that as a virtual case.
> 
> Alex


More information about the dri-devel mailing list