[PATCH] drm/xe/mmap: Add mmap support for PCI memory barrier

Tue Oct 15 11:43:25 UTC 2024

> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Monday, October 14, 2024 7:23 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> barrier
> 
> On 14/10/2024 14:08, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld at intel.com>
> >> Sent: Monday, October 14, 2024 4:34 PM
> >> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> >> barrier
> >>
> >> On 14/10/2024 11:23, Upadhyay, Tejas wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld at intel.com>
> >>>> Sent: Wednesday, October 9, 2024 7:27 PM
> >>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >>>> xe at lists.freedesktop.org
> >>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> >>>> barrier
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 09/10/2024 11:10, Tejas Upadhyay wrote:
> >>>>> In order to avoid having userspace to use MI_MEM_FENCE, we are
> >>>>> adding a mechanism for userspace to generate a PCI memory barrier
> >>>>> with low overhead (avoiding IOCTL call).
> >>>>
> >>>> I guess it doesn't work if we just do VRAM write from the host side?
> >>>
> >>> Writing to VRAM will adds some overhead which UMD wants to avoid. So
> >> the suggestion came out was to use special MMIO page mapped directly
> >> which is not accessible from the PCI bus so that the MMIO writes
> >> themselves are ignored, but the PCI memory barrier will still take
> >> action as the MMIO filtering will happen after the memory barrier effect.
> >>
> >> Would be good to add this info to the commit message.
> >
> > Ok.
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> This is implemented by memory-mapping a page as uncached that is
> >>>>> backed by MMIO on the dGPU and thus allowing userspace to do
> >> memory
> >>>>> write to the page without invoking an IOCTL.
> >>>>> We are selecting the MMIO so that it is not accessible from the
> >>>>> PCI bus so that the MMIO writes themselves are ignored, but the
> >>>>> PCI memory barrier will still take action as the MMIO filtering
> >>>>> will happen after the memory barrier effect.
> >>>>>
> >>>>> When we detect special defined offset in mmap(), We are mapping 4K
> >>>>> page which contains the last of page of doorbell MMIO range to
> >>>>> userspace for same purpose.
> >>>>>
> >>>>> Note: Test coverage for this is added by IGT
> >>>>>          https://patchwork.freedesktop.org/patch/618931/  here.
> >>>>>
> >>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>>>
> >>>> Needs some UMD Cc. This was requested by some UMD?
> >>>
> >>> Yes, I will add someone in cc.
> >>>
> >>>>
> >>>>> ---
> >>>>>     drivers/gpu/drm/xe/xe_device.c | 68
> >>>> +++++++++++++++++++++++++++++++++-
> >>>>>     1 file changed, 67 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>> b/drivers/gpu/drm/xe/xe_device.c index cd241a8e1838..c97a4d1f0a98
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>> @@ -239,12 +239,78 @@ static long xe_drm_compat_ioctl(struct file
> >>>>> *file,
> >>>> unsigned int cmd, unsigned lo
> >>>>>     #define xe_drm_compat_ioctl NULL
> >>>>>     #endif
> >>>>>
> >>>>> +static void barrier_open(struct vm_area_struct *vma) {
> >>>>> +	drm_dev_get(vma->vm_private_data);
> >>>>> +}
> >>>>> +
> >>>>> +static void barrier_close(struct vm_area_struct *vma) {
> >>>>> +	drm_dev_put(vma->vm_private_data);
> >>>>> +}
> >>>>> +
> >>>>> +static const struct vm_operations_struct vm_ops_barrier = {
> >>>>> +	.open = barrier_open,
> >>>>> +	.close = barrier_close,
> >>>>> +};
> >>>>> +
> >>>>> +static int xe_pci_barrier_mmap(struct file *filp,
> >>>>> +			       struct vm_area_struct *vma) {
> >>>>> +	struct drm_file *priv = filp->private_data;
> >>>>> +	struct drm_device *dev = priv->minor->dev;
> >>>>> +	unsigned long pfn;
> >>>>> +	pgprot_t prot;
> >>>>> +
> >>>>> +	if (vma->vm_end - vma->vm_start > PAGE_SIZE)
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	if (is_cow_mapping(vma->vm_flags))
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	if (vma->vm_flags & (VM_READ | VM_EXEC))
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	vm_flags_clear(vma, VM_MAYREAD | VM_MAYEXEC);
> >>>>> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
> >>>> VM_DONTDUMP | VM_IO);
> >>>>> +
> >>>>> +	prot = vm_get_page_prot(vma->vm_flags); #define
> >>>> LAST_DB_PAGE_OFFSET
> >>>>> +0x7ff001
> >>>>> +	pfn = PHYS_PFN(pci_resource_start(to_pci_dev(dev->dev), 0)
> +
> >>>>> +			LAST_DB_PAGE_OFFSET);
> >>>>> +	if (vmf_insert_pfn_prot(vma, vma->vm_start, pfn,
> >>>>> +				pgprot_noncached(prot)) !=
> >>>> VM_FAULT_NOPAGE)
> >>>>> +		return -EFAULT;
> >>>>> +
> >>>>> +	vma->vm_ops = &vm_ops_barrier;
> >>>>> +	vma->vm_private_data = dev;
> >>>>> +	drm_dev_get(vma->vm_private_data);
> >>>>
> >>>> What is going on here with pci power states? Say device goes into
> >>>> d3hot or d3cold and you touch this mmio, what happens? Do you get
> >>>> pci errors in dmesg? Does rpm need to get involved here? Might be
> >>>> good to have a testcase for this.
> >>>
> >>> I am not sure here. But we will have follow up IGT patch once basic
> >>> IGT is
> >> merged, where we will try to add test you are suggesting here.
> >>
> >> I think we need to understand this. If we need rpm here, then the
> >> design might need to change, since grabbing rpm for the lifetime of
> >> the mmap is likely the same as just disabling rpm completely, which
> >> is maybe not acceptable. Not sure though.
> >
> > we don't need rpm here as the membarrier is to generate a side-effect
> > on the GT to serialise any memory traffic across the PCI bus; if the
> > device is asleep, there is no memory traffic across the bus, thus the
> > requirement is satisfied
> 
> Ok, but would be good to confirm that. At least for stuff like pci access with
> LMEMBAR, IIRC device must be in d0 or you can get pci errors, even if writes
> are ignored.
> 
> >
> >>
> >>>
> >>>>
> >>>>> +	return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int xe_mmap(struct file *filp, struct vm_area_struct *vma) {
> >>>>> +	struct drm_file *priv = filp->private_data;
> >>>>> +	struct drm_device *dev = priv->minor->dev;
> >>>>> +
> >>>>> +	if (drm_dev_is_unplugged(dev))
> >>>>> +		return -ENODEV;
> >>>>
> >>>> With hotunplug I think there is also a special dummy page that gets
> >>>> faulted
> >> in
> >>>> if we detect that we are unplugged in the fault handler to avoid
> >>>> sigbus
> >> killing
> >>>> stuff (see xe_gem_fault()). Also I guess all mmaps are nuked on
> >>>> unplug so everything gets refaulted. But this doesn't look to use
> >>>> fault handler and
> >> just
> >>>> pre-populates it here, so wondering what would happen here?
> >>>
> >>> Got your point. This might have problem, Do you have any suggestion
> here?
> >>
> >> Not really. But maybe first just try it with an igt and see what
> >> happens? We can run hotunplug igt for example.
> >
> > If unplug happens then it's a new device at that point and the client needs
> to reload, so I think we don’t need to take care this.
> 
> At least for normal bo mmap stuff you usually don't want to crash the
> app/umd, hence why a lot of effort is spent trying to avoid returning SIGBUS
> by instead mapping a dummy page if we detect device was detached in the
> fault handler. The user will need to get a new driver fd (assuming device
> comes back), but user should be able to gracefully detect the unplug instead
> of crashing (all iocts will return an error for example).
> 
> When you do the unplug the driver instance stays alive, even after pci remove
> callback is called, where that driver instance is detached from the pci device
> underneath. If user is accessing the barrier ptr when this happens does it
> crash? Would be good to check that.

True, if I pause execution of my test with sleep and run test in backgroud, meanwhile do unplug-rescan. When my test resumes it crashes,
root at DUT1130BMGFRD:/home/gta/igt-gpu-tools# Received signal SIGBUS.
Stack trace:
 #0 [fatal_sig_handler+0x175]
 #1 [__sigaction+0x50]
 #2 [__igt_unique____real_main64+0xfb]
 #3 [main+0x2d]
 #4 [__libc_init_first+0x90]
 #5 [__libc_start_main+0x80]
 #6 [_start+0x25]
Subtest pci-membarrier-basic: CRASH (180.014s)

So how do you think handling of this should be done, via dummy_page? Any example which can be followed will be great help.

Tejas
> 
> >
> >>
> >>>
> >>>>
> >>>>> +
> >>>>> +#define XE_PCI_BARRIER_MMAP_OFFSET (0x50 << PAGE_SHIFT)
> >>>>
> >>>> This needs to go in uapi header I think?
> >>>
> >>> Yes I went though uapi header, did not see any offset there, did not find
> >> suitable place. Any suggestion where should this be positioned in uapi
> >> header?
> >>
> >> It might make sense to have mmap_offest flag instead (as per below),
> >> then we don't need this.
> >>
> >>>
> >>>>
> >>>>> +	switch (vma->vm_pgoff) {
> >>>>
> >>>> I think pgoff is the fake offset to be able to lookup the bo, so for this to
> >> work I
> >>>> think you would need to reserve this page somehow (if we go with this
> >>>> approach), otherwise user can easily end up mmapping a normal bo
> and
> >> get
> >>>> back a fake offset that happens to be this exact special barrier offset and
> >> then
> >>>> things go boom.
> >>>>
> >>>> Another idea could be to have mmap_offset called with special flag to
> >>>> indicate this BARRIER thing, and then it returns the magic reserved
> offset.
> >>>> That way we don't need to hardcode XE_PCI_BARRIER_MMAP_OFFSET
> in
> >> the
> >>>> uapi, and be forever stuck with it.
> >>>
> >>> Are you saying on different platforms this unused offset might change.
> For
> >> current purpose BMG and PVC both has correct value for this as unused
> >> offset.
> >>
> >> Why is it an unused offset, also why is it platform specific? The
> >> vm_pgoff can also contain the fake offset to lookup the BO, so you would
> >> now need to ensure that the special barrier offset is never used, and so
> >> doesn't clash with some random BO offset. Maybe I'm missing something
> >> here...
> >
> > Oh now I recall, for vm_pgoff, there is some offset added, which is leaving
> anything below that offset untouched. We selected vm_pgoff to be used from
> that area.
> 
> Ah, you mean DRM_FILE_PAGE_OFFSET_START? It does feel a bit scary to
> rely on that internal behaviour and then bake it into the uapi forever.
> So far I assume drivers are not really depending on that behaviour so
> that could change in the future.
> 
> >
> > Tejas
> >
> >>
> >>>
> >>> But I agree for future compatibility, we can have this handling in KMD
> itself
> >> to detect platform and provide correct unused defined offset to UMD to
> map.
> >> I will take care of this.
> >>>
> >>> Tejas
> >>>
> >>>>
> >>>>
> >>>>> +	case XE_PCI_BARRIER_MMAP_OFFSET >> PAGE_SHIFT:
> >>>>> +		return xe_pci_barrier_mmap(filp, vma);
> >>>>> +	}
> >>>>> +
> >>>>> +	return drm_gem_mmap(filp, vma);
> >>>>> +}
> >>>>> +
> >>>>>     static const struct file_operations xe_driver_fops = {
> >>>>>     	.owner = THIS_MODULE,
> >>>>>     	.open = drm_open,
> >>>>>     	.release = drm_release_noglobal,
> >>>>>     	.unlocked_ioctl = xe_drm_ioctl,
> >>>>> -	.mmap = drm_gem_mmap,
> >>>>> +	.mmap = xe_mmap,
> >>>>>     	.poll = drm_poll,
> >>>>>     	.read = drm_read,
> >>>>>     	.compat_ioctl = xe_drm_compat_ioctl,