[PATCH] drm/xe/mmap: Add mmap support for PCI memory barrier

Wed Oct 16 16:06:15 UTC 2024

> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Wednesday, October 16, 2024 9:25 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> barrier
> 
> On 16/10/2024 16:13, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld at intel.com>
> >> Sent: Monday, October 14, 2024 7:23 PM
> >> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> >> barrier
> >>
> >> On 14/10/2024 14:08, Upadhyay, Tejas wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld at intel.com>
> >>>> Sent: Monday, October 14, 2024 4:34 PM
> >>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >>>> xe at lists.freedesktop.org
> >>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
> >>>> barrier
> >>>>
> >>>> On 14/10/2024 11:23, Upadhyay, Tejas wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Auld, Matthew <matthew.auld at intel.com>
> >>>>>> Sent: Wednesday, October 9, 2024 7:27 PM
> >>>>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >>>>>> xe at lists.freedesktop.org
> >>>>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI
> memory
> >>>>>> barrier
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 09/10/2024 11:10, Tejas Upadhyay wrote:
> >>>>>>> In order to avoid having userspace to use MI_MEM_FENCE, we are
> >>>>>>> adding a mechanism for userspace to generate a PCI memory
> >>>>>>> barrier with low overhead (avoiding IOCTL call).
> >>>>>>
> >>>>>> I guess it doesn't work if we just do VRAM write from the host side?
> >>>>>
> >>>>> Writing to VRAM will adds some overhead which UMD wants to avoid.
> >>>>> So
> >>>> the suggestion came out was to use special MMIO page mapped
> >>>> directly which is not accessible from the PCI bus so that the MMIO
> >>>> writes themselves are ignored, but the PCI memory barrier will
> >>>> still take action as the MMIO filtering will happen after the memory
> barrier effect.
> >>>>
> >>>> Would be good to add this info to the commit message.
> >>>
> >>> Ok.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> This is implemented by memory-mapping a page as uncached that is
> >>>>>>> backed by MMIO on the dGPU and thus allowing userspace to do
> >>>> memory
> >>>>>>> write to the page without invoking an IOCTL.
> >>>>>>> We are selecting the MMIO so that it is not accessible from the
> >>>>>>> PCI bus so that the MMIO writes themselves are ignored, but the
> >>>>>>> PCI memory barrier will still take action as the MMIO filtering
> >>>>>>> will happen after the memory barrier effect.
> >>>>>>>
> >>>>>>> When we detect special defined offset in mmap(), We are mapping
> >>>>>>> 4K page which contains the last of page of doorbell MMIO range
> >>>>>>> to userspace for same purpose.
> >>>>>>>
> >>>>>>> Note: Test coverage for this is added by IGT
> >>>>>>>           https://patchwork.freedesktop.org/patch/618931/  here.
> >>>>>>>
> >>>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>>>>>
> >>>>>> Needs some UMD Cc. This was requested by some UMD?
> >>>>>
> >>>>> Yes, I will add someone in cc.
> >>>>>
> >>>>>>
> >>>>>>> ---
> >>>>>>>      drivers/gpu/drm/xe/xe_device.c | 68
> >>>>>> +++++++++++++++++++++++++++++++++-
> >>>>>>>      1 file changed, 67 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
> >>>>>>> cd241a8e1838..c97a4d1f0a98
> >>>>>>> 100644
> >>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>>>> @@ -239,12 +239,78 @@ static long xe_drm_compat_ioctl(struct
> >>>>>>> file *file,
> >>>>>> unsigned int cmd, unsigned lo
> >>>>>>>      #define xe_drm_compat_ioctl NULL
> >>>>>>>      #endif
> >>>>>>>
> >>>>>>> +static void barrier_open(struct vm_area_struct *vma) {
> >>>>>>> +	drm_dev_get(vma->vm_private_data);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static void barrier_close(struct vm_area_struct *vma) {
> >>>>>>> +	drm_dev_put(vma->vm_private_data);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static const struct vm_operations_struct vm_ops_barrier = {
> >>>>>>> +	.open = barrier_open,
> >>>>>>> +	.close = barrier_close,
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +static int xe_pci_barrier_mmap(struct file *filp,
> >>>>>>> +			       struct vm_area_struct *vma) {
> >>>>>>> +	struct drm_file *priv = filp->private_data;
> >>>>>>> +	struct drm_device *dev = priv->minor->dev;
> >>>>>>> +	unsigned long pfn;
> >>>>>>> +	pgprot_t prot;
> >>>>>>> +
> >>>>>>> +	if (vma->vm_end - vma->vm_start > PAGE_SIZE)
> >>>>>>> +		return -EINVAL;
> >>>>>>> +
> >>>>>>> +	if (is_cow_mapping(vma->vm_flags))
> >>>>>>> +		return -EINVAL;
> >>>>>>> +
> >>>>>>> +	if (vma->vm_flags & (VM_READ | VM_EXEC))
> >>>>>>> +		return -EINVAL;
> >>>>>>> +
> >>>>>>> +	vm_flags_clear(vma, VM_MAYREAD | VM_MAYEXEC);
> >>>>>>> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
> >>>>>> VM_DONTDUMP | VM_IO);
> >>>>>>> +
> >>>>>>> +	prot = vm_get_page_prot(vma->vm_flags); #define
> >>>>>> LAST_DB_PAGE_OFFSET
> >>>>>>> +0x7ff001
> >>>>>>> +	pfn = PHYS_PFN(pci_resource_start(to_pci_dev(dev->dev), 0)
> >> +
> >>>>>>> +			LAST_DB_PAGE_OFFSET);
> >>>>>>> +	if (vmf_insert_pfn_prot(vma, vma->vm_start, pfn,
> >>>>>>> +				pgprot_noncached(prot)) !=
> >>>>>> VM_FAULT_NOPAGE)
> >>>>>>> +		return -EFAULT;
> >>>>>>> +
> >>>>>>> +	vma->vm_ops = &vm_ops_barrier;
> >>>>>>> +	vma->vm_private_data = dev;
> >>>>>>> +	drm_dev_get(vma->vm_private_data);
> >>>>>>
> >>>>>> What is going on here with pci power states? Say device goes into
> >>>>>> d3hot or d3cold and you touch this mmio, what happens? Do you get
> >>>>>> pci errors in dmesg? Does rpm need to get involved here? Might be
> >>>>>> good to have a testcase for this.
> >>>>>
> >>>>> I am not sure here. But we will have follow up IGT patch once
> >>>>> basic IGT is
> >>>> merged, where we will try to add test you are suggesting here.
> >>>>
> >>>> I think we need to understand this. If we need rpm here, then the
> >>>> design might need to change, since grabbing rpm for the lifetime of
> >>>> the mmap is likely the same as just disabling rpm completely, which
> >>>> is maybe not acceptable. Not sure though.
> >>>
> >>> we don't need rpm here as the membarrier is to generate a
> >>> side-effect on the GT to serialise any memory traffic across the PCI
> >>> bus; if the device is asleep, there is no memory traffic across the
> >>> bus, thus the requirement is satisfied
> >>
> >> Ok, but would be good to confirm that. At least for stuff like pci
> >> access with LMEMBAR, IIRC device must be in d0 or you can get pci
> >> errors, even if writes are ignored.
> >
> > Tried d3hot with this test,
> >
> >          ptr = mmap(NULL, size, prot, flags, device.fd_xe, offset);
> >          igt_assert(ptr != MAP_FAILED);
> >
> >          sleep(1);
> >          igt_assert(in_d3(device, d_state));
> >          /* Check whole page for any errors, also check as
> >           * we should not read written values back
> >           */
> >          for (int i = 0; i < size / sizeof(*ptr); i++) {
> >                  /* It is expected unconfigured doorbell space
> >                   * will return read value 0xdeadbeef
> >                   */
> >                  igt_assert_eq_u32(READ_ONCE(ptr[i]), 0xdeadbeef);
> >
> >                  igt_gettime(&tv);
> >                  ptr[i] = i;
> >                  if (READ_ONCE(ptr[i]) == i) {
> >                          while (READ_ONCE(ptr[i]) == i)
> >                                  ;
> >                          igt_info("fd:%d value retained for %"PRId64"ns pos:%d\n",
> >                                   device.fd_xe, igt_nsec_elapsed(&tv), i);
> >                  }
> >                  igt_assert_neq(READ_ONCE(ptr[i]), i);
> >          }
> >
> > I am putting device in D3 and then reading, I am able to read back
> mappings.
> 
> Ok, cool. And dmesg is also clean? I guess would be good to try with d3cold.
> If we can turn the above into an igt, there should be something in CI that is
> d3cold capable to check this.

Dmesg is also clean I see some suspend/resume messages as below,

[ 5740.084015] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT0: suspending
[ 5740.084497] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT0: suspended
[ 5740.084549] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT1: suspending
[ 5740.085586] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT1: suspended
[ 5740.195624] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT0: resuming
[ 5740.218038] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT0: resumed
[ 5740.218112] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT1: resuming
[ 5741.289312] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT1: resumed
[ 5741.289472] [IGT] xe_pm: finished subtest d3hot-pci-membarrier, SUCCESS

I see some similar messages as above, before test passes completely. I did not find d3cold capable device, but I will try tomorrow to grab one and test.

> 
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> +	return 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int xe_mmap(struct file *filp, struct vm_area_struct *vma) {
> >>>>>>> +	struct drm_file *priv = filp->private_data;
> >>>>>>> +	struct drm_device *dev = priv->minor->dev;
> >>>>>>> +
> >>>>>>> +	if (drm_dev_is_unplugged(dev))
> >>>>>>> +		return -ENODEV;
> >>>>>>
> >>>>>> With hotunplug I think there is also a special dummy page that
> >>>>>> gets faulted
> >>>> in
> >>>>>> if we detect that we are unplugged in the fault handler to avoid
> >>>>>> sigbus
> >>>> killing
> >>>>>> stuff (see xe_gem_fault()). Also I guess all mmaps are nuked on
> >>>>>> unplug so everything gets refaulted. But this doesn't look to use
> >>>>>> fault handler and
> >>>> just
> >>>>>> pre-populates it here, so wondering what would happen here?
> >>>>>
> >>>>> Got your point. This might have problem, Do you have any
> >>>>> suggestion
> >> here?
> >>>>
> >>>> Not really. But maybe first just try it with an igt and see what
> >>>> happens? We can run hotunplug igt for example.
> >>>
> >>> If unplug happens then it's a new device at that point and the
> >>> client needs
> >> to reload, so I think we don’t need to take care this.
> >>
> >> At least for normal bo mmap stuff you usually don't want to crash the
> >> app/umd, hence why a lot of effort is spent trying to avoid returning
> >> SIGBUS by instead mapping a dummy page if we detect device was
> >> detached in the fault handler. The user will need to get a new driver
> >> fd (assuming device comes back), but user should be able to
> >> gracefully detect the unplug instead of crashing (all iocts will return an
> error for example).
> >>
> >> When you do the unplug the driver instance stays alive, even after
> >> pci remove callback is called, where that driver instance is detached
> >> from the pci device underneath. If user is accessing the barrier ptr
> >> when this happens does it crash? Would be good to check that.
> >
> > Created fault handler and did remapping and I see test is passing post
> unplug also.
> 
> You remapped it to the dummy page? Yeah, that should do the trick, I think.

I remapped actual physical page in fault handler as it will only return "deadbeef" which UMD test with pci barrier is expecting.

> 
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> +
> >>>>>>> +#define XE_PCI_BARRIER_MMAP_OFFSET (0x50 << PAGE_SHIFT)
> >>>>>>
> >>>>>> This needs to go in uapi header I think?
> >>>>>
> >>>>> Yes I went though uapi header, did not see any offset there, did
> >>>>> not find
> >>>> suitable place. Any suggestion where should this be positioned in
> >>>> uapi header?
> >>>>
> >>>> It might make sense to have mmap_offest flag instead (as per
> >>>> below), then we don't need this.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> +	switch (vma->vm_pgoff) {
> >>>>>>
> >>>>>> I think pgoff is the fake offset to be able to lookup the bo, so
> >>>>>> for this to
> >>>> work I
> >>>>>> think you would need to reserve this page somehow (if we go with
> >>>>>> this approach), otherwise user can easily end up mmapping a
> >>>>>> normal bo
> >> and
> >>>> get
> >>>>>> back a fake offset that happens to be this exact special barrier
> >>>>>> offset and
> >>>> then
> >>>>>> things go boom.
> >>>>>>
> >>>>>> Another idea could be to have mmap_offset called with special
> >>>>>> flag to indicate this BARRIER thing, and then it returns the
> >>>>>> magic reserved
> >> offset.
> >>>>>> That way we don't need to hardcode
> XE_PCI_BARRIER_MMAP_OFFSET
> >> in
> >>>> the
> >>>>>> uapi, and be forever stuck with it.
> >>>>>
> >>>>> Are you saying on different platforms this unused offset might change.
> >> For
> >>>> current purpose BMG and PVC both has correct value for this as
> >>>> unused offset.
> >>>>
> >>>> Why is it an unused offset, also why is it platform specific? The
> >>>> vm_pgoff can also contain the fake offset to lookup the BO, so you
> >>>> would now need to ensure that the special barrier offset is never
> >>>> used, and so doesn't clash with some random BO offset. Maybe I'm
> >>>> missing something here...
> >>>
> >>> Oh now I recall, for vm_pgoff, there is some offset added, which is
> >>> leaving
> >> anything below that offset untouched. We selected vm_pgoff to be used
> >> from that area.
> >>
> >> Ah, you mean DRM_FILE_PAGE_OFFSET_START? It does feel a bit scary to
> >> rely on that internal behaviour and then bake it into the uapi forever.
> >> So far I assume drivers are not really depending on that behaviour so
> >> that could change in the future.
> >
> > Are you insisting this change now or we leave it for future?
> 
> IMO we should at least consider not hard coding the offset in the uapi, but
> maybe just return it from mmap_offset when passing a special flag, with no
> hard guarentee that the offset can't change if kmd so chooses?
> What do you think?
> 
> Also if we want to keep the magic 0x50 instead of allocating an actual offset it
> might be good to add a build check against DRM_FILE_PAGE_OFFSET_START
> somewhere?

Build check looks good to me still let me go through both option tomorrow and get back with V2 on this. I will get this in separate patch.

Tejas
> 
> >
> > Tejas
> >>
> >>>
> >>> Tejas
> >>>
> >>>>
> >>>>>
> >>>>> But I agree for future compatibility, we can have this handling in
> >>>>> KMD
> >> itself
> >>>> to detect platform and provide correct unused defined offset to UMD
> >>>> to
> >> map.
> >>>> I will take care of this.
> >>>>>
> >>>>> Tejas
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> +	case XE_PCI_BARRIER_MMAP_OFFSET >> PAGE_SHIFT:
> >>>>>>> +		return xe_pci_barrier_mmap(filp, vma);
> >>>>>>> +	}
> >>>>>>> +
> >>>>>>> +	return drm_gem_mmap(filp, vma); }
> >>>>>>> +
> >>>>>>>      static const struct file_operations xe_driver_fops = {
> >>>>>>>      	.owner = THIS_MODULE,
> >>>>>>>      	.open = drm_open,
> >>>>>>>      	.release = drm_release_noglobal,
> >>>>>>>      	.unlocked_ioctl = xe_drm_ioctl,
> >>>>>>> -	.mmap = drm_gem_mmap,
> >>>>>>> +	.mmap = xe_mmap,
> >>>>>>>      	.poll = drm_poll,
> >>>>>>>      	.read = drm_read,
> >>>>>>>      	.compat_ioctl = xe_drm_compat_ioctl,