[Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
Alex Williamson
alex.williamson at redhat.com
Wed Apr 5 19:49:45 UTC 2023
On Wed, 5 Apr 2023 16:21:09 -0300
Jason Gunthorpe <jgg at nvidia.com> wrote:
> On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> > Usability needs to be a consideration as well. An interface where the
> > result is effectively arbitrary from a user perspective because the
> > kernel is solely focused on whether the operation is allowed,
> > evaluating constraints that the user is unaware of and cannot control,
> > is unusable.
>
> Considering this API is only invoked by qemu we might be overdoing
> this usability and 'no shoot in foot' view.
Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> > > This is a good point that qemu needs to make a policy decision if it
> > > is happy about the VFIO configuration - but that is a policy decision
> > > that should not become entangled with the kernel's security checks.
> > >
> > > Today qemu can make this policy choice the same way it does right now
> > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > as today. We already discussed that we need to expose the group ID
> > > through an ioctl someplace.
> >
> > QEMU can make a policy decision today because the kernel provides a
> > sufficiently reliable interface, ie. based on the set of owned groups, a
> > hot-reset is all but guaranteed to work.
>
> And we don't change that with cdev. If qemu wants to make the policy
> decision it keeps using the exact same _INFO interface to make that
> decision same it has always made.
>
> We weaken the actual reset action to only consider the security side.
>
> Applications that want this exclusive reset group policy simply must
> check it on their own. It is a reasonable API design.
I disagree, as I've argued before, the info ioctl becomes so weak and
effectively arbitrary from a user perspective at being able to predict
whether the hot-reset ioctl works that it becomes useless, diminishing
the entire hot-reset info/execute API.
> > > If this is too awkward we could add a query to the kernel if the cdev
> > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > the reset set.
> >
> > That's essentially what we have if there are valid dev-ids for each
> > affected device in the info ioctl.
>
> If you have dev-ids for everything, yes. If you don't, then you can't
> make the same policy choice using a dev-id interface.
Exactly, you can't make any policy choice because the success or
failure of the hot-reset ioctl can't be known.
> > I don't think it helps the user experience to create loopholes where
> > the hot-reset ioctl can still work in spite of those missing
> > devices.
>
> I disagree. The easy straightforward design is that the reset ioctl
> works if the process has security permissions. Mixing a policy check
> into the kernel on this path is creating complexity we don't really
> need.
>
> I don't view it as a loophole, it is flexability to use the API in a
> way that is different from what qemu wants - eg an app like dpdk may
> be willing to tolerate a reset group that becomes unavailable after
> startup. Who knows, why should we force this in the kernel?
Because look at all the problems it's causing to try to introduce these
loopholes without also introducing subtle bugs. There's an argument
that we're overly strict, which is better than the alternative, which
seems to be what we're dabbling with. It is a straightforward
interface for the hot-reset ioctl to mirror the information provided
via the hot-reset info ioctl.
> > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > capability chains, we could add a capability that reports the group ID
> > for the device.
>
> I was going to put that in an iommufd ioctl so it works with VDPA too,
> but sure, lets assume we can get the group ID from a cdev fd.
>
> > The hot-reset info ioctl remains as it is today, reporting group-ids
> > and bdfs.
>
> Sure, but userspace still needs to know how to map the reset sets into
> dev-ids.
No, it doesn't.
> Remember the reason we started doing this is because we don't
> have easy access to the BDF anymore.
We don't need it, the info ioctl provides the groups, the group
association can be learned from the DEVICE_GET_INFO ioctl, the
hot-reset ioctl only requires a single representative fd per affected
group. dev-ids not required.
> I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> return the dev_ids.
I don't see any justification for this. We could add another PCI
specific DEVICE_GET_INFO capability to report the bdf if we really need
it, but reporting the group seems sufficient for this use case.
> > The hot-reset ioctl itself is modified to transparently
> > support either group fds or device fds. The user can now map cdevs
> > to group-ids and therefore follow the same rules as groups,
> > providing at least one representative device fd for each group.
>
> This looks like a very complex uapi compared to the empty list option,
> but it seems like it would work.
It's the same API that we have now. What's complex is trying to figure
out all the subtle side-effects from the loopholes that are being
proposed in this series. Thanks,
Alex
More information about the Intel-gfx
mailing list