Cross-device and cross-driver HMM support
Thomas Hellström
thomas.hellstrom at linux.intel.com
Tue Apr 9 10:18:47 UTC 2024
Hi,
On Wed, 2024-04-03 at 12:09 -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 04:06:11PM +0200, Christian König wrote:
>
> [UGH html emails, try to avoid those they don't get archived!]
>
> > The problem with that isn't the software but the hardware.
> > At least on the AMD GPUs and Intels Xe accelerators we have seen
> > so far
> > page faults are not fast enough to actually work with the
> > semantics the
> > Linux kernel uses for struct pages.
> > That's why for example the SVM implementation really suck with
> > fork(),
> > the transparent huge page deamon and NUMA migrations.
> > Somebody should probably sit down and write a performance
> > measurement
> > tool for page faults so that we can start to compare vendors
> > regarding
> > this.
>
> Yes, all these page fault implementations I've seen are really
> slow. Even SVA/PRI is really slow. The only way it works usefully
> today is for the application/userspace environment to co-operate and
> avoid causing faults.
>
> Until someone invents a faster PRI interface this is what we have..
> It
> is limited but still useful.
>
> > The problem is the DMA API currently has no idea of inter device
> > connectors like XGMI.
> > So it can create P2P mappings for PCIe, but anything which isn't
> > part
> > of those interconnects is ignore at the moment as far as I can
> > see.
>
> Speaking broadly - a "multi-path" device is one that has multiple DMA
> initiators and thus multiple paths the DMA can travel. The different
> paths may have different properties, like avoiding the iommu or what
> not. This might be a private hidden bus (XGMI/nvlink/etc) in a GPU
> complex or just two PCI end ports on the same chip like a socket
> direct mlx5 device.
>
> The device HW itself must have a way to select which path each DMA
> goes thorugh because the paths are going to have different address
> spaces. A multi-path PCI device will have different PCI RID's and
> thus
> different iommu_domains/IO pagetables/IOVAs, for instance. A GPU will
> alias its internal memory with the PCI IOMMU IOVA.
>
> So, in the case of something like a GPU I expect the private PTE
> itself to have bit(s) indicating if the address is PCI, local memory
> or internal interconnect.
>
> When the hmm_range_fault() encounters a DEVICE_PRIVATE page the GPU
> driver must make a decision on how to set that bit.
>
> My advice would be to organize the GPU driver so that the
> "dev_private_owner" is the same value for all GPU's that share a
> private address space. IOW dev_private_owner represents the physical
> *address space* that the DEVICE_PRIVATE's hidden address lives in,
> not
> the owning HW. Perhaps we will want to improve on this by adding to
> the pgmap an explicit address space void * private data as well.
>
> When setup like this hmm_range_fault() will naturally return
> DEVICE_PRIVATE pages which map to the address space for which the
> requesting GPU can trivially set the PTE bit on. Easy. No DMA API
> fussing needed.
>
> Otherwise hmm_range_fault() returns the CPU/P2P page. The GPU should
> select the PCI path and the DMA API will check the PCI topology and
> generate a correct PCI address.
>
> If the device driver needs/wants to create driver core bus's and
> devices to help it model and discover the dev_private_owner groups, I
> don't know. Clearly the driver must be able to do this grouping to
> make it work, and all this setup is just done when creating the
> pgmap.
>
> I don't think the DMA API should become involved here. The layering
> in
> a multi-path scenario should have the DMA API caller decide on the
> path then the DMA API will map for the specific path. The caller
> needs
> to expressly opt into this because there is additional HW - the
> multi-path selector - that needs to be programmed and the DMA API
> cannot make that transparent.
>
> A similar approach works for going from P2P pages as well, the driver
> can inspect the pgmap owner and similarly check the pgmap private
> data
> to learn the address space and internal address then decide to choose
> the non-PCI path.
>
> This scales to a world without P2P struct pages because we will still
> have some kind of 'pgmap' similar structure that holds meta data for
> a
> uniform chunk of MMIO.
Thanks everyone for suggestions and feedback. We've been discussion
something like what Jason is describing above although I haven't had
time to digest all the details yet.
It sounds like common drm- or core code is the preferred way to go
here. I also recognize that gpuvm was successful in this respect but I
think that gpuvm also had a couple of active reviwers and multiple
drivers that were able to spend time to implement and test the code, so
let's hope for at least some active review participation and feedback
here.
Thanks,
Thomas
>
> Jason
More information about the dri-devel
mailing list