[Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
Tian, Kevin
kevin.tian at intel.com
Tue Apr 18 03:24:46 UTC 2023
> From: Alex Williamson <alex.williamson at redhat.com>
> Sent: Tuesday, April 18, 2023 4:07 AM
>
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx). Based on what I explained about my understanding of
> INFO2
> > > and Jason agreed to, I think the output would be:
> > >
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > > { valid devA-id, devA-BDF },
> > > { valid devC-id, devC-BDF },
> > > { valid devD-id, devD-BDF },
> > > { invalid dev-id, devE-BDF },
> > > }
> > >
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned. It's therefore not a blocker for
> > > hot-reset.
> >
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> >
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> >
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
>
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes? I'm afraid this leads into expanding "invalid dev-id" into an
consistent output before and after devB is opened.
> errno or bitmap of error conditions that the user needs to parse.
>
Not exactly.
If RESETABLE invalid dev_id doesn't matter. The user only use the
valid dev_id list to iterate as Jason pointed out.
If NOT_RESETTABLE due to devE not assigned to the VM one can
easily figure out the fact by simply looking at the list of affected BDFs
and the configuration of assigned devices of the VM. Then invalid
dev_id also doesn't matter.
If NOT_RESETTABLE while devE is already assigned to the VM then it's
indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
people should debug with other means/hints to dig out the exact
culprit.
More information about the Intel-gfx
mailing list