[PATCH 0/3] sample: vfio mdev display devices.

Wed Apr 25 15:30:39 UTC 2018

On 4/25/2018 4:29 AM, Alex Williamson wrote:
> On Wed, 25 Apr 2018 01:20:08 +0530
> Kirti Wankhede <kwankhede at nvidia.com> wrote:
> 
>> On 4/24/2018 3:10 AM, Alex Williamson wrote:
>>> On Wed, 18 Apr 2018 12:31:53 -0600
>>> Alex Williamson <alex.williamson at redhat.com> wrote:
>>>   
>>>> On Mon,  9 Apr 2018 12:35:10 +0200
>>>> Gerd Hoffmann <kraxel at redhat.com> wrote:
>>>>  
>>>>> This little series adds three drivers, for demo-ing and testing vfio
>>>>> display interface code.  There is one mdev device for each interface
>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).    
>>>>
>>>> Erik Skultety brought up a good question today regarding how libvirt is
>>>> meant to handle these different flavors of display interfaces and
>>>> knowing whether a given mdev device has display support at all.  It
>>>> seems that we cannot simply use the default display=auto because
>>>> libvirt needs to specifically configure gl support for a dmabuf type
>>>> interface versus not having such a requirement for a region interface,
>>>> perhaps even removing the emulated graphics in some cases (though I
>>>> don't think we have boot graphics through either solution yet).
>>>> Additionally, GVT-g seems to need the x-igd-opregion support
>>>> enabled(?), which is a non-starter for libvirt as it's an experimental
>>>> option!
>>>>
>>>> Currently the only way to determine display support is through the
>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
>>>> their own they'd need to get to the point where they could open the
>>>> vfio device and perform the ioctl.  That means opening a vfio
>>>> container, adding the group, setting the iommu type, and getting the
>>>> device.  I was initially a bit appalled at asking libvirt to do that,
>>>> but the alternative is to put this information in sysfs, but doing that
>>>> we risk that we need to describe every nuance of the mdev device
>>>> through sysfs and it becomes a dumping ground for every possible
>>>> feature an mdev device might have.
>>>>  
>>
>> One or two sysfs file for each feature shouldn't be that much of over
>> head? In kernel, other subsystem modules expose capability through
>> sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which
>> returns 0/1 depending on if its boot VGA device. Similarly
>> 'd3cold_allowed', 'msi_bus'...
> 
> Obviously we could add sysfs files, but unlike properties that the PCI
> core exposes about struct pci_dev fields, the idea of a vfio_device is
> much more abstract.  Each bus driver creates its own device
> representation, so we have a top level vfio_device referencing through
> an opaque pointer a vfio_pci_device, vfio_platform_device, or
> mdev_device, and each mdev vendor driver creates its own private data
> structure below the mdev_device.  So it's not quite a simple as one new
> attribute "show" function to handle all devices of that bus_type.  We
> need a consistent implementation in each bus driver and vendor driver
> or we need to figure out how to percolate the information up to the
> vfio core.  Your idea below seems to take the percolate approach.
>  
>>>> So I was ready to return and suggest that maybe libvirt should probe
>>>> the device to know about these ancillary configuration details, but
>>>> then I remembered that both mdev vGPU vendors had external dependencies
>>>> to even allow probing the device.  KVMGT will fail to open the device
>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I
>>>> believe, will fail if the vGPU manager process cannot find the QEMU
>>>> instance to extract the VM UUID.  (Both of these were bad ideas)  
>>>
>>> Here's another proposal that's really growing on me:
>>>
>>>  * Fix the vendor drivers!  Allow devices to be opened and probed
>>>    without these external dependencies.
>>>  * Libvirt uses the existing vfio API to open the device and probe the
>>>    necessary ioctls, if it can't probe the device, the feature is
>>>    unavailable, ie. display=off, no migration.
>>>   
>>
>> I'm trying to think simpler mechanism using sysfs that could work for
>> any feature and knowing source-destination migration compatibility check
>> by libvirt before initiating migration.
>>
>> I have another proposal:
>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
>> struct vfio_device_features {
>>     __u32 argsz;
>>     __u32 features;
>> }
>>
>> Define bit for each feature:
>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
>> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
>>
>> * Vendor driver returns bitmask of supported features during
>> initialization phase.
>>
>> * In vfio core module, trap this ioctl for each device  in
>> vfio_device_fops_unl_ioctl(),
> 
> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
> blocking point with mdev drivers, we can't get a device fd, so we can't
> call an ioctl on the device fd.
> 

I'm sorry, I thought we could expose features when QEMU initialize, but
libvirt needs to know supported features before QEMU initialize.

>> check features bitmask returned by vendor
>> driver and add a sysfs file if feature is supported that device. This
>> sysfs file would return 0/1.
> 
> I don't understand why we have an ioctl interface, if the user can get
> to the device fd then we have existing interfaces to probe these
> things, it seems like you're just wanting to pass a features bitmap
> through to vfio_add_group_dev() that vfio-core would expose through
> sysfs, but a list of feature bits doesn't convey enough info except for
> the most basic uses.
>  

Yes, vfio_add_group_dev() seems to be better way to convey features to
vfio core.

>> For migration this bit will only indicate if host driver supports
>> migration feature.
>>
>> For source and destination compatibility check libvirt would need more
>> data/variables to check like,
>> * if same type of 'mdev_type' device create-able at destination,
>>    i.e. if ('mdev_type'->available_instances > 0)
>>
>> * if host_driver_version at source and destination are compatible.
>> Host driver from same release branch should be mostly compatible, but if
>> there are major changes in structures or APIs, host drivers from
>> different branches might not be compatible, for example, if source and
>> destination are from different branches and one of the structure had
>> changed, then data collected at source might not be compatible with
>> structures at destination and typecasting it to changed structures would
>> mess up migrated data during restoration.
> 
> Of course now you're asking that libvirt understand the release
> versioning scheme of every vendor driver and that it remain
> programatically consistent.  We can't even do this with in-kernel
> drivers.  And in the end, still the best we can do is guess.
>

Libvirt doesn't need to understand the version, libvirt need to do
strcmp version string from source and destination. If those are equal,
then libvirt would understand that they are compatible.

>> * if guest_driver_version is compatible with host driver at destination.
>> For mdev devices, guest driver communicates with host driver in some
>> form. If there are changes in structures/APIs of such communication,
>> guest driver at source might not be compatible with host driver at
>> destination.
> 
> And another guess plus now the guest driver is involved which libvirt
> has no visibility to.
>  

Like above libvirt need to do strcmp.

>> 'available_instances' sysfs already exist, later two should be added by
>> vendor driver which libvirt can use for migration compatibility check.
> 
> As noted previously, display and migration are not necessarily
> mdev-only features, it's possible that vfio-pci or vfio-platform could
> also implement these, so the sysfs interface cannot be restricted to
> the mdev template and lifecycle interface.
> 

I agree.
Feature bitmask passed to vfio core is not mdev specific. But here
'available_instances' for migration compatibility check is mdev
specific. If mdev device is not create-able at destination, there is no
point in initiating migration by libvirt.

> One more try... we have a vfio_group fd.  This is created by the bus
> drivers calling vfio_add_group_dev() and registers a struct device, a
> struct vfio_device_ops, and private data.  Typically we only wire the
> device_ops to the resulting file descriptor we get from
> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
> through the group fd?  The ioctl would need to take a string arg to
> match to a device name, plus an ioctl cmd and arg for the device_ops
> ioctl.  The group ioctl would need to filter cmds to known, benign
> queries.  We'd also need to verify that the allowed ioctls have no
> dependencies on setup done in device_ops.open().

So these ioctls would be called without devices open() call, doesn't
this seem to be against file operations standard?

Thanks,
Kirti

>  *_INFO and
> QUERY_GFX_PLANE ioctls would be the only candidates.  Bus drivers could
> of course keep an open count in their private data so they know how the
> ioctl is being called (if necessary) and the group fd only allows a
> single open, so there's no risk that another user could interact with
> the group in bad ways once the device is opened (and of course we use
> file level access control on the group device file anyway).  This is
> sort of a rethink of Paolo's suggestion of a read-only fd, but the fd
> is the existing group fd and any references to the device would only be
> held around the calling of the nested ioctl.  Could it work?  Thanks,
>