[PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
Kasireddy, Vivek
vivek.kasireddy at intel.com
Thu May 2 07:50:36 UTC 2024
Hi Jason,
>
> On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > > +{
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > > + pgoff_t pgoff = vmf->pgoff;
> > > +
> > > + if (pgoff >= priv->nr_pages)
> > > + return VM_FAULT_SIGBUS;
> > > +
> > > + return vmf_insert_pfn(vma, vmf->address,
> > > + page_to_pfn(priv->pages[pgoff]));
> > > +}
> >
> > How does this prevent the MMIO space from being mmap'd when disabled
> at
> > the device? How is the mmap revoked when the MMIO becomes disabled?
> > Is it part of the move protocol?
In this case, I think the importers that mmap'd the dmabuf need to be tracked
separately and their VMA PTEs need to be zapped when MMIO access is revoked.
>
> Yes, we should not have a mmap handler for dmabuf. vfio memory must be
> mmapped in the normal way.
Although optional, I think most dmabuf exporters (drm ones) provide a mmap
handler. Otherwise, there is no easy way to provide CPU access (backup slow path)
to the dmabuf for the importer.
>
> > > +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> > > +{
> > > + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > +
> > > + /*
> > > + * Either this or vfio_pci_dma_buf_cleanup() will remove from the
> list.
> > > + * The refcount prevents both.
> > > + */
> > > + if (priv->vdev) {
> > > + release_p2p_pages(priv, priv->nr_pages);
> > > + kfree(priv->pages);
> > > + down_write(&priv->vdev->memory_lock);
> > > + list_del_init(&priv->dmabufs_elm);
> > > + up_write(&priv->vdev->memory_lock);
> >
> > Why are we acquiring and releasing the memory_lock write lock
> > throughout when we're not modifying the device memory enable state?
> > Ugh, we're using it to implicitly lock dmabufs_elm/dmabufs aren't we...
>
> Not really implicitly, but yes the dmabufs list is locked by the
> memory_lock.
>
> > > +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev,
> u32 flags,
> > > + struct vfio_device_feature_dma_buf __user
> *arg,
> > > + size_t argsz)
> > > +{
> > > + struct vfio_device_feature_dma_buf get_dma_buf;
> > > + struct vfio_region_p2p_area *p2p_areas;
> > > + DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > > + struct vfio_pci_dma_buf *priv;
> > > + int i, ret;
> > > +
> > > + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> > > + sizeof(get_dma_buf));
> > > + if (ret != 1)
> > > + return ret;
> > > +
> > > + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> > > + return -EFAULT;
> > > +
> > > + p2p_areas = memdup_array_user(&arg->p2p_areas,
> > > + get_dma_buf.nr_areas,
> > > + sizeof(*p2p_areas));
> > > + if (IS_ERR(p2p_areas))
> > > + return PTR_ERR(p2p_areas);
> > > +
> > > + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > + if (!priv)
> > > + return -ENOMEM;
> >
> > p2p_areas is leaked.
>
> What is this new p2p_areas thing? It wasn't in my patch..
As noted in the commit message, this is one of the things I added to
your original patch.
>
> > > + exp_info.ops = &vfio_pci_dmabuf_ops;
> > > + exp_info.size = priv->nr_pages << PAGE_SHIFT;
> > > + exp_info.flags = get_dma_buf.open_flags;
> >
> > open_flags from userspace are unchecked.
>
> Huh. That seems to be a dmabuf pattern. :\
>
> > > + exp_info.priv = priv;
> > > +
> > > + priv->dmabuf = dma_buf_export(&exp_info);
> > > + if (IS_ERR(priv->dmabuf)) {
> > > + ret = PTR_ERR(priv->dmabuf);
> > > + goto err_free_pages;
> > > + }
> > > +
> > > + /* dma_buf_put() now frees priv */
> > > + INIT_LIST_HEAD(&priv->dmabufs_elm);
> > > + down_write(&vdev->memory_lock);
> > > + dma_resv_lock(priv->dmabuf->resv, NULL);
> > > + priv->revoked = !__vfio_pci_memory_enabled(vdev);
> > > + vfio_device_try_get_registration(&vdev->vdev);
> >
> > I guess we're assuming this can't fail in the ioctl path of an open
> > device?
>
> Seems like a bug added here.. My version had this as
> vfio_device_get(). This stuff has probably changed since I wrote it.
vfio_device_try_get_registration() is essentially doing the same thing as
vfio_device_get() except that we need check the return value of
vfio_device_try_get_registration() which I plan to do in v2.
>
> > > + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> > > + dma_resv_unlock(priv->dmabuf->resv);
> >
> > What was the purpose of locking this?
>
> ?
>
> > > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> revoked)
> > > +{
> > > + struct vfio_pci_dma_buf *priv;
> > > + struct vfio_pci_dma_buf *tmp;
> > > +
> > > + lockdep_assert_held_write(&vdev->memory_lock);
> > > +
> > > + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> {
> > > + if (!get_file_rcu(&priv->dmabuf->file))
> > > + continue;
> >
> > Does this indicate the file was closed?
>
> Yes.. The original patch was clearer, Christian asked to open
> code it:
>
> + * Returns true if a reference was successfully obtained. The caller must
> + * interlock with the dmabuf's release function in some way, such as RCU, to
> + * ensure that this is not called on freed memory.
>
> A description of how the locking is working should be put in a comment
> above that code.
Sure, will add it in v2.
>
> > > @@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct
> vfio_pci_core_device *vdev, int pos,
> > > *virt_cmd &= cpu_to_le16(~mask);
> > > *virt_cmd |= cpu_to_le16(new_cmd & mask);
> > >
> > > + if (__vfio_pci_memory_enabled(vdev))
> > > + vfio_pci_dma_buf_move(vdev, false);
> > > up_write(&vdev->memory_lock);
> > > }
> >
> > FLR is also accessible through config space.
>
> That needs fixing up
>
> > > @@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct
> vfio_pci_core_device *vdev,
> > > */
> > > vfio_pci_set_power_state(vdev, PCI_D0);
> > >
> > > + vfio_pci_dma_buf_move(vdev, true);
> > > ret = pci_try_reset_function(vdev->pdev);
> > > + if (__vfio_pci_memory_enabled(vdev))
> > > + vfio_pci_dma_buf_move(vdev, false);
> > > up_write(&vdev->memory_lock);
> > >
> >
> > What about runtime power management?
>
> Yes
>
> Yes, I somehow thing it was added
Ok, I'll handle runtime PM and FLR cases in v2.
>
> > > -static int vfio_pci_core_feature_token(struct vfio_device *device, u32
> flags,
> > > - uuid_t __user *arg, size_t argsz)
> > > +static int vfio_pci_core_feature_token(struct vfio_pci_core_device
> *vdev,
> > > + u32 flags, uuid_t __user *arg,
> > > + size_t argsz)
> > > {
> > > - struct vfio_pci_core_device *vdev =
> > > - container_of(device, struct vfio_pci_core_device, vdev);
> >
> > Why is only this existing function updated? If the core device deref
> > is common then apply it to all and do so in a separate patch. Thanks,
>
> Hm, I think that was som rebasing issue.
Yeah, looks like the above change may not be needed.
>
> > > +/**
> > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > + * region selected.
> > > + *
> > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> O_CLOEXEC,
> > > + * etc. offset/length specify a slice of the region to create the dmabuf
> from.
> > > + * If both are 0 then the whole region is used.
> > > + *
> > > + * Return: The fd number on success, -1 and errno is set on failure.
> > > + */
> > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > > +
> > > +struct vfio_region_p2p_area {
> > > + __u32 region_index;
> > > + __u32 __pad;
> > > + __u64 offset;
> > > + __u64 length;
> > > +};
> > > +
> > > +struct vfio_device_feature_dma_buf {
> > > + __u32 open_flags;
> > > + __u32 nr_areas;
> > > + struct vfio_region_p2p_area p2p_areas[];
> > > +};
>
> Still have no clue what this p2p areas is. You want to create a dmabuf
> out of a scatterlist? Why??
Because the data associated with a buffer that needs to be shared can
come from multiple ranges. I probably should have used the terms ranges
or slices or chunks to make it more clear instead of p2p areas.
In my use-case, GPU A (in a guest VM and bound to vfio-pci on Host) writes
to a buffer (framebuffer in device mem/VRAM in this case) that needs to be
shared with GPU B on the Host. Since the framebuffer can be at-least 8 MB
(assuming 1920x1080) or more in size, it is not reasonable to expect that it
would be allocated as one big contiguous chunk in device memory/VRAM.
>
> I'm also not sure of the use of the pci_p2pdma family of functions, it
> is a bold step to make struct pages, that isn't going to work in quite
I guess things may have changed since the last discussion on this topic or
maybe I misunderstood but I thought Christoph's suggestion was to use
struct pages to populate the scatterlist instead of using DMA addresses
and, I figured pci_p2pdma APIs can easily help with that.
Do you see any significant drawback in using pci_p2pdma APIs? Or, could
you please explain why this approach would not work in a lot of cases?
> alot of cases. It is really hacky/wrong to immediately call
> pci_alloc_p2pmem() to defeat the internal genalloc.
In my use-case, I need to use all the pages from the pool and I don't see any
better way to do it.
>
> I'd rather we stick with the original design. Leon is working on DMA
> API changes that should address half the issue.
Ok, I'll keep an eye out for Leon's work.
Thanks,
Vivek
>
> Jason
More information about the dri-devel
mailing list