[Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Jason Gunthorpe jgg at nvidia.com
Tue Apr 11 18:40:07 UTC 2023


On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> [Appears the list got dropped, replying to my previous message to re-add]

Wowo this got mesed up alot, mutt drops the cc when replying for some
reason. I think it is fixed up now

> > Our cdev model says that opening a cdev locks out other cdevs from
> > independent use, eg because of the group sharing. Extending this to
> > include the reset group as well seems consistent.
> 
> The DMA ownership model based on the IOMMU group is consistent with
> legacy vfio, but now you're proposing a new ownership model that
> optionally allows a user to extend their ownership, opportunistically
> lock out other users, and wreaking havoc for management utilities that
> also have no insight into dev_sets or userspace driver behavior.

I suggested below that the owership require enough open devices - so
it doesn't "extend ownership opportunistically", and there is no
havoc.

Management tools already need to understand dev_set if they want to
offer reliable reset support to the VMs. Same as today.
 
> > There is some security concern here, but that goes both ways, a 3rd
> > party should not be able to break an application that needs to use
> > this RESET and had sufficient privileges to assert an ownership.
> 
> There are clearly scenarios we have now that could break.  For example,
> today if QEMU doesn't own all the IOMMU groups for a mult-function
> device, it can't do a reset, the remaining functions are available for
> other users. 

Sure, and we can keep that with this approach.

> As I understand the proposal, QEMU now gets to attempt to
> claim ownership of the dev_set, so it opportunistically extends its
> ownership and may block other users from the affected devices.

We can decide the policy for the kernel to accept a claim. I suggested
below "same as today" - it must hold all the groups within the
iommufd_ctx.

The main point is to make this claiming operation qemu needs to do
clearer and more explicit. I view this as better than trying to guess
if it successfully made the claim by inspecting the _INFO output.

> > I'd say anyone should be able to assert RESET ownership if, like
> > today, the iommufd_ctx has all the groups of the dev_set inside
> > it. Once asserted it becomes safe against all forms of hotplug, and
> > continues to be safe even if some of the devices are closed. eg hot
> > unplugging from the VM doesn't change the availability of RESET.
> > 
> > This comes from your ask that qemu know clearly if RESET works, and it
> > doesn't change while qemu is running. This seems stronger and clearer
> > than the current implicit scheme. It also doesn't require usespace to
> > do any calculations with groups or BDFs to figure out of RESET is
> > available, kernel confirms it directly.
> 
> As above, clarity and predictability seem lacking in this proposal.
> With the current scheme, the ownership of the affected devices is
> implied if they exist within an owned group, but the strength of that
> ownership is clear.  

Same logic holds here

Ownership is claimed same as today by having all groups representated
in the iommufd_ctx. This seems just as clear as today.

> > > seems this proposal essentially extends the ownership model to the
> > > greater of the dev_set or iommu group, apparently neither of which
> > > are explicitly exposed to the user in the cdev API.  
> > 
> > IIRC the group id can be learned from sysfs before opening the cdev
> > file. Something like /sys/class/vfio/XX/../../iommu_group
> 
> And in the passed cdev fd model... ?

IMHO we should try to avoid needing to expose group_id specifically to
userspace. We are missing a way to learn the "same ioas" restriction
in iommufd, and it should provide that directly based on dev_ids.

Otherwise if we really really need group_id then iommufd should
provide an ioctl to get it. Let's find a good reason first

> > We should also have an iommufd ioctl to report the "same ioas"
> > groupings of dev_ids to make it easy on userspace. I haven't checked
> > to see what the current qemu patches are doing with this..
> 
> Seems we're ignoring that no-iommu doesn't have a valid iommufd.

no-iommu doesn't and shouldn't have iommu_groups either. It also
doesn't have an IOAS so querying for same-IOAS is not necessary.

The simplest option for no-iommu is to require it to pass in every
device fd to the reset ioctl.

> > > How does a user determine when devices cannot be used independently
> > > in the cdev API?   
> > 
> > We have this problem right now. The only way to learn the reset group
> > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > under /sys/class/vfio/XX/ if something needs it earlier.
> 
> For all the complaints about complexity, now we're asking management
> tools to not only take into account IOMMU groups, but also reset
> groups, and some inferred knowledge about the application and devices
> to speculate whether reset group ownership is taken by a given
> userspace??

No, we are trying to keep things pretty much the same as today without
resorting to exposing a lot of group related concepts.

The reset group is a clear concept that already exists and isn't
exposed. If we really need to know about it then it should be exposed
on its own, as a seperate discussion from this cdev stuff.

I want to re-focus on the basics of what cdev is supposed to be doing,
because several of the idea you suggested seem against this direction:

 - cdev does not have, and cannot rely on vfio_groups. We enforce this
   by compiling all the vfio_group infrastructure out. iommu_groups
   continue to exist.
   
   So converting a cdev to a vfio_group is not an allowed operation.

 - no-iommu should not have iommu_groups. We enforce this by compiling
   out all the no-iommu vfio_group infrastructure.

 - cdev APIs should ideally not require the user to know the group_id,
   we should try hard to design APIs to avoid this.

We have solved every other problem but reset like this, I would like
to get past reset without compromising the above.

Jason


More information about the Intel-gfx mailing list