[Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Alex Williamson alex.williamson at redhat.com
Wed Apr 12 16:50:45 UTC 2023


On Tue, 11 Apr 2023 21:01:06 -0300
Jason Gunthorpe <jgg at nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> 
> > > Management tools already need to understand dev_set if they want to
> > > offer reliable reset support to the VMs. Same as today.  
> > 
> > I don't think that's true. Our primary hot-reset use case is GPUs and
> > subordinate functions, where the isolation and reset scope are often
> > sufficiently similar to make hot-reset possible, regardless whether
> > all the functions are assigned to a VM.  I don't think you'll find any
> > management tools that takes reset scope into account otherwise.  
> 
> When I think of "reliable reset support" I think of the management
> tool offering a checkbox that says "ensure PCI function reset
> availability" and if checked it will not launch the VM without a
> working reset.

This doesn't exist.

> If the user configures a set of VFIO devices and then hopes they get
> working reset, that is fine, but doesn't require any reporting of
> reset groups, or iommu groups to the management layer to work.

I think there's more than hope involved here, there are recipes to
create working hot-reset configurations because it is well specified
and predictable currently.  QEMU can indicate whether hot-reset is
available thanks to the information provided in the INFO ioctl and a VM
that owns the necessary set of groups may consistently and repeatedly
perform hot-resets.

> > > > As I understand the proposal, QEMU now gets to attempt to
> > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > ownership and may block other users from the affected devices.    
> > > 
> > > We can decide the policy for the kernel to accept a claim. I suggested
> > > below "same as today" - it must hold all the groups within the
> > > iommufd_ctx.  
> > 
> > It must hold all the groups [that the user doesn't know about because
> > it's not a formal part of the cdev API] within the iommufd_ctx?  
> 
> You keep going back to this, but I maintain userspace doesn't
> care. qemu is given a list of VFIO devices to use, all it wants to
> know is if it is allowed to use reset or not. Why should it need to
> know groups and group_ids to get that binary signal out of the kernel?

hw/vfio/pci.c:2320
        error_report("vfio: Cannot reset device %s, "
                     "depends on group %d which is not owned.",
                     vdev->vbasedev.name, devices[i].group_id);

That creates a feedback loop where a user can take corrective action
with actual information in hand to resolve the issue.

> > > The simplest option for no-iommu is to require it to pass in every
> > > device fd to the reset ioctl.  
> > 
> > Which ironically is exactly how it ends up working today, each no-iommu
> > device has a fake IOMMU group, so every affected device (group) needs
> > to be provided.  
> 
> Sure, that is probably the way forward for no-iommu. Not that anyone
> uses it..
> 
> The kicker is we don't force the user to generate a de-duplicated list
> of devices FDs, one per group, just because.

So on one hand you're asking for simplicity, but on the other you're
criticizing a trivial simplification that we chose to allow the user to
pass number of group fds equal to number of devices affected so that
the user doesn't need to take that step to de-duplicate the list.  We
can't win.
 
> > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > because several of the idea you suggested seem against this direction:
> > > 
> > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > >    continue to exist.
> > >    
> > >    So converting a cdev to a vfio_group is not an allowed operation.  
> > 
> > My only statements in this respect were towards the notion that IOMMU
> > groups continue to exist.  I'm well aware of the desire to deprecate
> > and remove vfio groups.  
> 
> Yes
> 
> > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > >    out all the no-iommu vfio_group infrastructure.  
> > 
> > This is not logically inferred from the above if IOMMU groups continue
> > to exist and continue to be a basis for describing DMA ownership as
> > well as "reset groups"  
> 
> It is not ment to flow out of the above, it is a seperate statement. I
> want the iommu_group mechanism to stop being abused outside the iommu
> core code. The only thing that should be creating groups is an
> attached iommu driver operating under ops->device_group().
> 
> VFIO needed this to support mdev and no-iommu. We already have mdev
> free of iommu_groups, I would like no-iommu to also be free of it too,
> we are very close.
> 
> That would leave POWER as the only abuser of the
> iommu_group_add_device() API, and it is only doing it because it
> hasn't got a proper iommu driver implementation yet. It turns out
> their abuse is mislocked and maybe racy to boot :(
> 
> > >  - cdev APIs should ideally not require the user to know the group_id,
> > >    we should try hard to design APIs to avoid this.  
> > 
> > This is a nuance, group_id vs group, where it's been previously
> > discussed that users will need to continue to know the boundaries of a
> > group for the purpose of DMA isolation and potentially IOAS
> > independence should cdev/iommufd choose to tackle those topics.  
> 
> Yes, group_id is a value we have no specific use for and would require
> userspace to keep seperate track of. I'd prefer to rely on dev_id as
> much as possible instead.

But dev-id only has meaning in relation to an iommufd_ctx, so it fails
to be useful in the context of implied ownership.

> > What is the actual proposal here?  
> 
> I don't know anymore, you don't seem to like this direction either...
> 
> > You've said that hot-reset works if the iommufd_ctx has
> > representation from each affected group, the INFO ioctl remains as
> > it is, which suggests that it's reporting group ID and BDF, yet only
> > sysfs tells the user the relation between a vfio cdev and a group
> > and we're trying to enable a pass-by-fd model for cdev where the
> > user has no reference to a sysfs node for the device.  Show me how
> > these pieces fit together.  
> 
> I prefer the version where INFO2 returns the dev_id, but info can work
> if we do the BDF cap like you suggested to Yi

As discussed ad nauseam, dev-id is useless if an affected device is not
already within the iommufd ctx.  BDF provides a mapping to specific
affected devices, but can't express implied ownership.  Group id
provides the implied ownership, but can't express specific devices.  As
Yi has pointed out, QEMU needs to know both if it has ownership of all
the affected devices, both direct and implied, and which specific
devices that it owns are affected.

> > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > device has an IOMMU group  
> 
> I don't desire every VFIO device to have an iommu_group. I want VFIO
> devices with real IOMMU drivers to have an iommu_group. mdev and
> no-iommu should not. I don't want to add them back into the design
> just so INFO has a value to return.
> 
> I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> iommu_group...

It's not been shown to me that dev-id is a useful replacement for
anything here.

> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.

That is super sketchy because you're also advocating for
opportunistically supporting reset if the instantaneous conditions
allow is (ex. unopened devices), and going back and forth whether "ask
it to keep working" suggests that a user is able to extend their
granted ownership themselves.  I think both needs to be based on some
form of granted, not requested, ownership and not opportunism.

> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

Where dev-id is useful for... ?  I think there's a misuse of "groups"
in 3) above, userspace needs to know specific devices affected, thus
BDF.

> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

We can't assume that all the affected devices are bound to vfio,
therefore we cannot assume a vfio_device IDR exists.

> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.

"...doesn't need the information to figure out what the list actually
is."  That's false, userspace needs the information whether it uses it
to make a list or not, ex. pre- and post-reset processing of specific
affected devices.  Furthermore, supporting a zero length array removes
context from the existing ioctl, which has been shown to make it prone
to creating gaps in legacy group use cases, so I don't understand why
this optimization is so pervasive or important.
 
> None of these requirements suggests to me that qemu needs to know the
> group_id, or that it needs to have enough information to know how to
> fix an unavailable reset.
> 
> Did I miss a requirement here?

So what is the exact proposal?  We can't have an INFO ioctl that simply
returns error if the ownership requirements are not met as that doesn't
support 4).  So we need one or more ioctls that a) indicates whether
the ownership requirements are met and b) indicates the set of affected
devices.  Is b) only the set of affected devices within the calling
devices iommufd_ctx (ie. dev-ids), in which case we need c) a way to
report the overall set of affected devices regardless of ownership in
support of 4), BDF?

Are we back to replacing group-ids with dev-ids in the INFO structure,
where an invalid dev-id either indicates an affected device with
implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
is meant to indicate the overall disposition based on the availability
of reset?  I'm not sure how that fully supports 4) since the user can't
determine if a given invalid dev-id is in fact a blocker, so do we end
up with multiple invalid IDs, perhaps one to indicate unknown but ok
and another to indicate an ownership gap?  Are devices outside of the
iommufd_ctx, but with implied ownership via group omitted entirely from
the lists?  I think we need an actual proposal here.  Thanks,

Alex



More information about the Intel-gfx mailing list