[PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Wed Sep 7 16:12:52 UTC 2022

On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
> >  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
> >     certain special cases.
> 
> Not just certain special cases, but one of the main use cases.
> Basically P2P can happen in two ways:
>
>  a) through a PCIe switch, or
>  b) through connected root ports

Yes, we tested both, both work.

> The open code version here only supports a), only supports it when there
> is no offset between the 'phyiscal' address of the BAR seen PCIe
> endpoint and the Linux way.  x86 usually (always?) doesn't have an
> offset there, but other architectures often do.

The PCI offset is some embedded thing - I've never seen it in a server
platform.

Frankly, it is just bad SOC design and there is good reason why
non-zero needs to be avoided. As soon as you create aliases between
the address spaces you invite trouble. IIRC a SOC I used once put the
memory at 0 -> 4G then put the only PCI aperture at 4g ->
4g+N. However this design requires 64 bit PCI support, which at the
time, the platform didn't have. So they used PCI offset to hackily
alias the aperture over the DDR. I don't remember if they threw out a
bit of DDR to resolve the alias, or if they just didn't support PCI
switches.

In any case, it is a complete mess. You either drastically limit your
BAR size, don't support PCI switches or loose a lot of DDR.

I also seem to remember that iommu and PCI offset don't play nice
together - so for the VFIO use case where the iommu is present I'm
pretty sure we can very safely assume 0 offset. That seems confirmed
by the fact that VFIO has never handled PCI offset in its own P2P path
and P2P works fine in VMs across a wide range of platforms.

That said, I agree we should really have APIs that support this
properly, and dma_map_resource is certainly technically wrong.

So, would you be OK with this series if I try to make a dma_map_p2p()
that resolves the offset issue?

> Last but not least I don't really see how the code would even work
> when an IOMMU is used, as dma_map_resource will return an IOVA that
> is only understood by the IOMMU itself, and not the other endpoint.

I don't understand this.

__iommu_dma_map() will put the given phys into the iommu_domain
associated with 'dev' and return the IOVA it picked.

Here 'dev' is the importing device, it is the device that will issue
the DMA:

+       dma_addr = dma_map_resource(
+               attachment->dev,
+               pci_resource_start(priv->vdev->pdev, priv->index) +
+                       priv->offset,
+               priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

eg attachment->dev is the PCI device of the RDMA device, not the VFIO
device.

'phys' is the CPU physical of the PCI BAR page, which with 0 PCI
offset is the right thing to program into the IO page table.

> How was this code even tested?

It was tested on a few platforms, like I said above, the cases where
it doesn't work are special, largely embedded, and not anything we
have in our labs - AFAIK.

Jason