[Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Alex Williamson alex.williamson at redhat.com
Thu Apr 6 17:53:47 UTC 2023


On Thu, 6 Apr 2023 10:02:10 +0000
"Liu, Yi L" <yi.l.liu at intel.com> wrote:

> > From: Jason Gunthorpe <jgg at nvidia.com>
> > Sent: Thursday, April 6, 2023 7:23 AM
> > 
> > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> >   
> > > > > QEMU can make a policy decision today because the kernel provides a
> > > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > > hot-reset is all but guaranteed to work.  
> > > >
> > > > And we don't change that with cdev. If qemu wants to make the policy
> > > > decision it keeps using the exact same _INFO interface to make that
> > > > decision same it has always made.
> > > >
> > > > We weaken the actual reset action to only consider the security side.
> > > >
> > > > Applications that want this exclusive reset group policy simply must
> > > > check it on their own. It is a reasonable API design.  
> > >
> > > I disagree, as I've argued before, the info ioctl becomes so weak and
> > > effectively arbitrary from a user perspective at being able to predict
> > > whether the hot-reset ioctl works that it becomes useless, diminishing
> > > the entire hot-reset info/execute API.  
> > 
> > reset should be strictly more permissive than INFO. If INFO predicts
> > reset is permitted then reset should succeed.
> > 
> > We don't change INFO so it cannot "becomes so weak"  ??
> > 
> > We don't care about the cases where INFO says it will not succeed but
> > reset does (temporarily) succeed.
> > 
> > I don't get what argument you are trying to make or what you think is
> > diminished..
> > 
> > Again, userspace calls INFO, if info says yes then reset *always
> > works*, exactly just like today.
> >
> > Userspace will call reset with a 0 length FD list and it uses a
> > security only check that is strictly more permissive than what
> > get_info will return. So the new check is simple in the kernel and
> > always works in the cases we need it to work.
> > 
> > What is getting things into trouble is insisting that RESET have
> > additional restrictions beyond the minimum checks required for
> > security.
> >   
> > > > I don't view it as a loophole, it is flexability to use the API in a
> > > > way that is different from what qemu wants - eg an app like dpdk may
> > > > be willing to tolerate a reset group that becomes unavailable after
> > > > startup. Who knows, why should we force this in the kernel?  
> > >
> > > Because look at all the problems it's causing to try to introduce these
> > > loopholes without also introducing subtle bugs.  
> > 
> > These problems are coming from tring to do this integrated version,
> > not from my approach!
> > 
> > AFAICT there was nothing wrong with my original plan of using the
> > empty fd list for reset. What Yi has here is some mashup of what you
> > and I both suggested.  
> 
> Hi Alex, Jason,
> 
> could be this reason. So let me try to gather the changes of this series
> does and the impact as far as I know.
> 
> 1) only check the ownership of opened devices in the dev_set
>      in HOT_RESET ioctl.
>      - Impact: it changes the relationship between _INFO  and HOT_RESET.
>        As " Each group must have IOMMU protection established for the
>        ioctl to succeed." in [1], existing design actually means userspace
>        should own all the affected groups before heading to do HOT_RESET.
>        With the change here, the user does not need to ensure all affected
>        groups are opened and it can do hot-reset successfully as long as the
>        devices in the affected group are just un-opened and can be reset.
>     
>        [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/

Where whether a device is opened is subject to change outside of the
user's control.  This essentially allows the user to perform hot-resets
of devices outside of their ownership so long as the device is not
used elsewhere, versus the current requirement that the user own all the
affected groups, which implies device ownership.  It's not been
justified why this feature needs to exist, imo.
 
> 2) Allow passing zero-length fd array to do hot reset
>     - Impact: this uses the iommufd as ownership check in the kernel side.
>       It is only supposed to be used by the users that open cdev instead of
>       users that open group. The drawback is that it cannot cover the noiommu
>       devices as noiommu does not use iommufd at all. But it works well for
>       most cases.

The "only supposed to be used" is problematic here, we're extending all
the interfaces to transparently accept group and device fds, but here
we need to make a distinction because the ioctl needs to perform one
way for groups and another way for devices, which it currently doesn't
do.  As above, I've not seen sufficient justification for this other
than references to reducing complexity, but the only userspace expected
to make use of this interface already has equivalent complexity.
 
> 3) Allow hot reset be successful when the dev_set is singleton
>      - Impact: this makes sense but it seems to mess up the boundary between
>      the group path and cdev path w.r.t. the usage of zero-length fd approach.
>      The group path can succeed to do hot reset even if it is passing an empty
>      fd array if the dev_set happens to be singleton.

Again, what is the justification for requiring this, it seems to be
only a hack towards no-iommu support with cdev, which we can achieve by
other means.  Why have we not needed this in the group model?  It
introduces subtle loopholes, so while maybe we could, I don't see why we
should, therefore I cannot agree with "this makes sense".

> 4) Allow passing device fd to do hot reset
>     - Impact: this is a new way for hot reset. should have no impact.
> 
> 5) Extend the _INFO to report devid
>     - Impact: this changes the way user to decode the info reported back.
>     devid and groupid are returned per the way the queried device is opened.
>     Since it was suggested to support the scenario in which some devices
>     are opened via cdev while some devices are opened via group. This makes
>     us to return invalid_devid for the device that is opened via group if
>     it is affected by the hot reset of a device that is opened via cdev.
>     
>     This was proposed to support the future device fd passing usage which is
>     only available in cdev path.

I think this is fundamentally flawed because of the scope of the
dev-id.  We can only provide dev-ids for devices which belong to the
same iommufd of the calling device, thus there are multiple instances
where no dev-id can be provided.  The group-id and bdf are static
properties of the devices, regardless of their ownership.  The bdf
provides the specific device level association while the group-id
indicates implied, static ownership.

> To me the major confusion is from 1) and 3). 1) changes the meaning of
> _INFO and HOT_RESET, while 3) messes up the boundary.

As above, I think 2) is also an issue.

> Here is my thought:
> 
> For 1), it was proposed due to below reason[2]. We'd like to make a scenario
> that works in the group path be workable in cdev path as well. But IMHO, we
> may just accept that cdev path cannot work for such scenario to avoid sublte
> change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
> hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
> is expected. Maybe this is awkward. But if we want to keep it, we'd do it
> with the awareness by user.
> 
> [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/

The group association is that relaxed ownership test.  Yes, there are
corner cases where we have a dual function card with separate IOMMU
groups, where a user owning function 0 could do a bus reset because
function 1 is temporarily unused, but so what, what good is that, have
we ever had an issue raised because of this?  The user can't rely on
the unopened state of the other function.  It's an entirely
opportunistic optimization.

The much more typical scenario is that a multi-function device does not
provide isolation, all the functions are in the same group and because
of the association of the group the user has implied ownership of the
other devices for the purpose of a reset.

> For 3), it was proposed when discussing the hot reset for noiommu[3]. But
> it does not make hot reset always workable for noiommu in cdev, just in
> case dev_set is singleton. So it is more of a general optimization that can
> make the kernel skip the ownership check. But to make use of it, we may
> need to test it before sanitizing the group fds from user or the iommufd
> check. Maybe the dev_set singleton test in this series is not well placed.
> If so, I can further modify it.
> 
> [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/

As above, this seems to be some optimization related to no-iommu for
cdev because we don't have an iommufd association for the device in
no-iommu mode.  Note however that the current group interface doesn't
care about the IOMMU context of the devices.  We only need proof that
the user owns the affected groups.  So why are we bringing iommufd
context anywhere into this interface, here or the null-array interface?

It seems like the minor difference with cdev is that a) we're passing
device fds rather than group fds, and b) those device fds need to be
validated as having device access to complete the proof of ownership
relative to the group.  Otherwise we add capabilities to
DEVICE_GET_INFO to support the device fd passing model where the user
doesn't know the device group or bdf and allow the reset ioctl itself
to accept device fds (extracting the group relationship for those which
the user has configured for access).  Thanks,

Alex



More information about the Intel-gfx mailing list