[RFC PATCH 13/29] drm/xe/mmap: Add mmap support for PCI memory barrier

Tue Nov 19 12:42:20 UTC 2024

"Adding Michal from the compute userspace team for sharing references to the code.

Quoting Christian König (2024-11-19 12:00:44)
> Am 19.11.24 um 00:37 schrieb Matthew Brost:
> > From: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >
> > In order to avoid having userspace to use MI_MEM_FENCE, we are 
> > adding a mechanism for userspace to generate a PCI memory barrier 
> > with low overhead (avoiding IOCTL call as well as writing to VRAM 
> > will adds some overhead).
> >
> > This is implemented by memory-mapping a page as uncached that is 
> > backed by MMIO on the dGPU and thus allowing userspace to do memory 
> > write to the page without invoking an IOCTL.
> > We are selecting the MMIO so that it is not accessible from the PCI 
> > bus so that the MMIO writes themselves are ignored, but the PCI 
> > memory barrier will still take action as the MMIO filtering will 
> > happen after the memory barrier effect.
> >
> > When we detect special defined offset in mmap(), We are mapping 4K 
> > page which contains the last of page of doorbell MMIO range to 
> > userspace for same purpose.
> 
> Well that is quite a hack, but don't you still need a memory barrier 
> instruction? E.g. m_fence?

I guess you refer on the userspace usage directions? Yeah, the userspace definitely has to make sure that the write actually propagated to the PCI bus before they can assume the serialization to happen on the GPU. I think the userspace folks should be able to explain how exactly the orchestrate that. Michal, can you or somebody else share the respective lines of code in the userspace driver?

At this time, the userspace only enables this on X86, but could also support other more exotic platforms via libpciaccess.

> And why don't you expose the real doorbell instead of the last 
> (unused?) page of the MMIO region?

Doorbells are a complete red herring here. 

Chosen page just happens to be a full 4K MMIO page where any writes coming over PCI bus get dropped (and reads return zero) by the GPU. Such dummy (from CPU point of view) 4K MMIO page allows doing a CPU write that generates a PCI bus transaction, where the transaction itself is essentially a NOP. But as the transaction falls into the MMIO address range, it will trigger a serialization of the incoming traffic in the GPU side, before being ignored.

Regards, Joonas
"

Here is appropriate path:
https://github.com/intel/compute-runtime/blob/f589408848128434e410b6b4c2a9107ff78a74e9/shared/source/direct_submission/direct_submission_hw.inl#L437

flow is as follows:
1. do updates to shared memory between CPU/GPU using WC memory mapping 
2. emit sfence instruction to make sure there is no reordering on the CPU side
3. emit pciBarrier write (this patch) , this ensures that all earlier transactions are properly ordered from the GPU side

So PCI memory barrier is submitted after sfence instruction and that makes sure that all earlier transactions are properly ordered.

Michal