[PATCH v2 00/11] Connect VFIO to IOMMUFD

Mon Nov 14 14:55:21 UTC 2022

On 11/14/22 9:23 AM, Jason Gunthorpe wrote:
> On Thu, Nov 10, 2022 at 10:01:13PM -0500, Matthew Rosato wrote:
>> On 11/7/22 7:52 PM, Jason Gunthorpe wrote:
>>> This series provides an alternative container layer for VFIO implemented
>>> using iommufd. This is optional, if CONFIG_IOMMUFD is not set then it will
>>> not be compiled in.
>>>
>>> At this point iommufd can be injected by passing in a iommfd FD to
>>> VFIO_GROUP_SET_CONTAINER which will use the VFIO compat layer in iommufd
>>> to obtain the compat IOAS and then connect up all the VFIO drivers as
>>> appropriate.
>>>
>>> This is temporary stopping point, a following series will provide a way to
>>> directly open a VFIO device FD and directly connect it to IOMMUFD using
>>> native ioctls that can expose the IOMMUFD features like hwpt, future
>>> vPASID and dynamic attachment.
>>>
>>> This series, in compat mode, has passed all the qemu tests we have
>>> available, including the test suites for the Intel GVT mdev. Aside from
>>> the temporary limitation with P2P memory this is belived to be fully
>>> compatible with VFIO.
>>
>> AFAICT there is no equivalent means to specify
>> vfio_iommu_type1.dma_entry_limit when using iommufd; looks like
>> we'll just always get the default 65535.
> 
> No, there is no arbitary limit on iommufd

Yeah, that's what I suspected.  But FWIW, userspace checks the advertised limit via VFIO_IOMMU_GET_INFO / VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL, and this is still being advertised as 65535 when using iommufd.  I don't think there is a defined way to return 'ignore this value'. 

This should go away later when we bind to iommufd directly since QEMU would not be sharing the type1 codepath in userspace. 

> 
>> Was this because you envision the limit being not applicable for
>> iommufd (limits will be enforced via either means and eventually we
>> won't want to ) or was it an oversight?
> 
> The limit here is primarily about limiting userspace abuse of the
> interface.
> 
> iommufd is using GFP_KERNEL_ACCOUNT which shifts the responsiblity to
> cgroups, which is similar to how KVM works.
> 
> So, for a VM sandbox you'd set a cgroup limit and if a hostile
> userspace in the sanbox decides to try to OOM the system it will hit
> that limit, regardless of which kernel APIs it tries to abuse.
> 
> This work is not entirely complete as we also need the iommu driver to
> use GFP_KERNEL_ACCOUNT for allocations connected to the iommu_domain,
> particularly for allocations of the IO page tables themselves - which
> can be quite big.
> 
> Jason