[RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
Jason Gunthorpe
jgg at nvidia.com
Wed Jan 8 16:22:27 UTC 2025
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> >
> > > > Having the importer do the mapping is the correct way to operate the
> > > > DMA API and the new API that Leon has built to fix the scatterlist
> > > > abuse in dmabuf relies on importer mapping as part of it's
> > > > construction.
> > > Exactly on that I strongly disagree on.
> > >
> > > DMA-buf works by providing DMA addresses the importer can work with and
> > > *NOT* the underlying location of the buffer.
> > The expectation is that the DMA API will be used to DMA map (most)
> > things, and the DMA API always works with a physaddr_t/pfn
> > argument. Basically, everything that is not a private address space
> > should be supported by improving the DMA API. We are on course for
> > finally getting all the common cases like P2P and MMIO solved
> > here. That alone will take care of alot.
>
> Well, from experience the DMA API has failed more often than it actually
> worked in the way required by drivers.
The DMA API has been static and very hard to change in these ways for
a long time. I think Leon's new API will break through this and we
will be able finally address these issues.
> > For P2P cases we are going toward (PFN + P2P source information) as
> > input to the DMA API. The additional "P2P source information" provides
> > a good way for co-operating drivers to represent private address
> > spaces as well. Both importer and exporter can have full understanding
> > what is being mapped and do the correct things, safely.
>
> I can say from experience that this is clearly not going to work for all use
> cases.
>
> It would mean that we have to pull a massive amount of driver specific
> functionality into the DMA API.
That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.
Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.
So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.
> > So, no, we don't loose private address space support when moving to
> > importer mapping, in fact it works better because the importer gets
> > more information about what is going on.
>
> Well, sounds like I wasn't able to voice my concern. Let me try again:
>
> We should not give importers information they don't need. Especially not
> information about the backing store of buffers.
>
> So that importers get more information about what's going on is a bad thing.
I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.
> > I have imagined a staged approach were DMABUF gets a new API that
> > works with the new DMA API to do importer mapping with "P2P source
> > information" and a gradual conversion.
>
> To make it clear as maintainer of that subsystem I would reject such a step
> with all I have.
This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.
If it going to be like this I think we will stop trying to use dmabuf
and do something simpler for vfio/kvm/iommufd :(
> We have already gone down that road and it didn't worked at all and
> was a really big pain to pull people back from it.
Nobody has really seriously tried to improve the DMA API before, so I
don't think this is true at all.
> > Exporter mapping falls down in too many cases already:
> >
> > 1) Private addresses spaces don't work fully well because many devices
> > need some indication what address space is being used and scatter list
> > can't really properly convey that. If the DMABUF has a mixture of CPU
> > and private it becomes a PITA
>
> Correct, yes. That's why I said that scatterlist was a bad choice for the
> interface.
>
> But exposing the backing store to importers and then let them do whatever
> they want with it sounds like an even worse idea.
You keep saying this without real justification. To me it is a nanny
style of API design. But also I don't see how you can possibly fix the
above without telling the importer alot more information.
> > 2) Multi-path PCI can require the importer to make mapping decisions
> > unique to the device and program device specific information for the
> > multi-path. We are doing this in mlx5 today and have hacks because
> > DMABUF is destroying the information the importer needs to choose the
> > correct PCI path.
>
> That's why the exporter gets the struct device of the importer so that it
> can plan how those accesses are made. Where exactly is the problem with
> that?
A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.
Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.
Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.
But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.
> When you have an use case which is not covered by the existing DMA-buf
> interfaces then please voice that to me and other maintainers instead of
> implementing some hack.
Do you have any suggestion for any of this then? We have a good plan
to fix this stuff and more. Many experts in their fields have agreed
on the different parts now. We haven't got to dmabuf because I had no
idea there would be an objection like this.
> > 3) Importing devices need to know if they are working with PCI P2P
> > addresses during mapping because they need to do things like turn on
> > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > today that assume DMABUFs are always P2P because we cannot determine
> > if things are P2P or not after being DMA mapped.
>
> Why would you need ATS on PCI P2P and not for system memory accesses?
ATS has a significant performance cost. It is mandatory for PCI P2P,
but ideally should be avoided for CPU memory.
> > 4) TPH bits needs to be programmed into the importer device but are
> > derived based on the NUMA topology of the DMA target. The importer has
> > no idea what the DMA target actually was because the exporter mapping
> > destroyed that information.
>
> Yeah, but again that is completely intentional.
>
> I assume you mean TLP processing hints when you say TPH and those should be
> part of the DMA addresses provided by the exporter.
Yes, but is not part of the DMA addresses.
> That an importer tries to look behind the curtain and determines the NUMA
> placement and topology themselves is clearly a no-go from the design
> perspective.
I strongly disagree, this is important. Drivers need this information
in a future TPH/UIO/multipath PCI world.
> > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > exporter mapping is possible
>
> We have customers using both KVM and XEN with DMA-buf, so I can clearly
> confirm that this isn't true.
Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.
Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.
Jason
More information about the dri-devel
mailing list