<!DOCTYPE html><html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> Am 08.01.25 um 17:22 schrieb Jason Gunthorpe: <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com">[SNIP] <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> I can say from experience that this is clearly not going to work for all use cases. It would mean that we have to pull a massive amount of driver specific functionality into the DMA API. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types. Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working. So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here.</pre> </blockquote> Well to be honest that sounds like an absolutely horrible design. You are moving all responsibilities for inter driver handling into the drivers themselves without any supervision by the core OS. Drivers are notoriously buggy and should absolutely not do things like that on their own. Do you have pointers to this new API? <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Well, sounds like I wasn't able to voice my concern. Let me try again: We should not give importers information they don't need. Especially not information about the backing store of buffers. So that importers get more information about what's going on is a bad thing. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> I strongly disagree because we are suffering today in mlx5 because of this viewpoint. You cannot predict in advance what importers are going to need. I already listed many examples where it does not work today as is.</pre> </blockquote> Well there is no need to predict in advance what importers are going to do. We just gather the requirements of the importers and see how we can fulfill them. And yes, it is absolutely intentional that maintainers reject some of those requirements because the they find that the importer tries to do something totally nasty which shouldn't happen in the first place. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> To make it clear as maintainer of that subsystem I would reject such a step with all I have. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> This is unexpected, so you want to just leave dmabuf broken? Do you have any plan to fix it, to fix the misuse of the DMA API, and all the problems I listed below? This is a big deal, it is causing real problems today.</pre> </blockquote> Well as maintainer of DMA-buf I don't think it is broken. It's just missing requirements. When you have new requirements we can certainly talk about them, but giving drivers a green card to do whatever nasty thing they want to do is not going to fly at all. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Exporter mapping falls down in too many cases already: 1) Private addresses spaces don't work fully well because many devices need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Correct, yes. That's why I said that scatterlist was a bad choice for the interface. But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> You keep saying this without real justification.</pre> </blockquote> Well I have really strong justification for that. Please look at the code here and how it cam to be: <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/v6.12.6/source/drivers/dma-buf/dma-buf.c#L776">https://elixir.bootlin.com/linux/v6.12.6/source/drivers/dma-buf/dma-buf.c#L776</a> We intentionally mangle the struct page pointers in the scatterlist to prevent importers from trying to access it. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <pre class="moz-quote-pre" wrap="">To me it is a nanny style of API design.</pre> </blockquote> Yes, absolutely and that is really necessary. Inter driver APIs are hard work because the framework not only needs to define rules but also makes sure to properly enforce them. That's why we not only added tons of documentation how to implement things, but also annotation which gives big warnings whenever somebody tries to do something nasty. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <pre class="moz-quote-pre" wrap="">But also I don't see how you can possibly fix the above without telling the importer alot more information.</pre> </blockquote> Giving additional information to the DMA addresses is perfectly ok. It's just that those additional information should be the output of the DMA API or DMA-buf and not something drivers try to figure out on their own. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">2) Multi-path PCI can require the importer to make mapping decisions unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> A single struct device does not convey the multipath options. We have multiple struct devices (and multiple PCI endpoints) doing DMA concurrently under one driver. Multipath always needs additional meta information in the importer side to tell the device which path to select. A naked dma address is not sufficient. Today we guess that DMABUF will be using P2P and hack to choose a P2P struct device to pass the exporter. We need to know what is in the dmabuf before we can choose which of the multiple struct devices the driver has to use for DMA mapping. But even simple CPU centric cases we will eventually want to select the proper NUMA local PCI channel matching struct device for CPU only buffers.</pre> </blockquote> So basically you want something which tells you if device A or device B should be used to access something? That was discussed before as well, but so far the resolution was that userspace should decide that stuff and not the kernel. Would it be sufficient to have a query API where you can ask the DMA-buf exporter for information about the buffer? E.g. NUMA distance etc... <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Do you have any suggestion for any of this then?</pre> </blockquote> Yeah, compared to other requirements that sounds trivial what you have at hand here. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">5) iommufd and kvm are both using CPU addresses without DMA. No exporter mapping is possible </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Today they are mmaping the dma-buf into a VMA and then using KVM's follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable dma-buf must have a CPU PFN.</pre> </blockquote> Yeah and I'm the one who forced the KVM guys to do it like that. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <pre class="moz-quote-pre" wrap="">Here Xu implements basically the same path, except without the VMA indirection, and it suddenly not OK? Illogical.</pre> </blockquote> Well I'm the designer of this KVM solution and it's not illogical to reject that when you are missing quite a bunch of stuff. Sima correctly pointed out that you need an mmu notifier if you want to work with PFNs, but that is just the tip of the iceberg here. If you want to go down the CPU access path you also need to cover things like cacheing flags, lifetime, dynamic changing backing etc... Regards, Christian. <blockquote type="cite" cite="mid:20250108162227.GT5556@nvidia.com"> <pre class="moz-quote-pre" wrap=""> Jason </pre> </blockquote> </body> </html>