[PATCH] drm/xe/mmap: Add mmap support for PCI memory barrier

Wed Oct 16 16:23:30 UTC 2024

On 16/10/2024 17:06, Upadhyay, Tejas wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld at intel.com>
>> Sent: Wednesday, October 16, 2024 9:25 PM
>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>> xe at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>> barrier
>>
>> On 16/10/2024 16:13, Upadhyay, Tejas wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>> Sent: Monday, October 14, 2024 7:23 PM
>>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>>>> xe at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>>>> barrier
>>>>
>>>> On 14/10/2024 14:08, Upadhyay, Tejas wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>>>> Sent: Monday, October 14, 2024 4:34 PM
>>>>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>>>>>> xe at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>>>>>> barrier
>>>>>>
>>>>>> On 14/10/2024 11:23, Upadhyay, Tejas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>>>>>> Sent: Wednesday, October 9, 2024 7:27 PM
>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>>>>>>>> xe at lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI
>> memory
>>>>>>>> barrier
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 09/10/2024 11:10, Tejas Upadhyay wrote:
>>>>>>>>> In order to avoid having userspace to use MI_MEM_FENCE, we are
>>>>>>>>> adding a mechanism for userspace to generate a PCI memory
>>>>>>>>> barrier with low overhead (avoiding IOCTL call).
>>>>>>>>
>>>>>>>> I guess it doesn't work if we just do VRAM write from the host side?
>>>>>>>
>>>>>>> Writing to VRAM will adds some overhead which UMD wants to avoid.
>>>>>>> So
>>>>>> the suggestion came out was to use special MMIO page mapped
>>>>>> directly which is not accessible from the PCI bus so that the MMIO
>>>>>> writes themselves are ignored, but the PCI memory barrier will
>>>>>> still take action as the MMIO filtering will happen after the memory
>> barrier effect.
>>>>>>
>>>>>> Would be good to add this info to the commit message.
>>>>>
>>>>> Ok.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is implemented by memory-mapping a page as uncached that is
>>>>>>>>> backed by MMIO on the dGPU and thus allowing userspace to do
>>>>>> memory
>>>>>>>>> write to the page without invoking an IOCTL.
>>>>>>>>> We are selecting the MMIO so that it is not accessible from the
>>>>>>>>> PCI bus so that the MMIO writes themselves are ignored, but the
>>>>>>>>> PCI memory barrier will still take action as the MMIO filtering
>>>>>>>>> will happen after the memory barrier effect.
>>>>>>>>>
>>>>>>>>> When we detect special defined offset in mmap(), We are mapping
>>>>>>>>> 4K page which contains the last of page of doorbell MMIO range
>>>>>>>>> to userspace for same purpose.
>>>>>>>>>
>>>>>>>>> Note: Test coverage for this is added by IGT
>>>>>>>>>            https://patchwork.freedesktop.org/patch/618931/  here.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
>>>>>>>>
>>>>>>>> Needs some UMD Cc. This was requested by some UMD?
>>>>>>>
>>>>>>> Yes, I will add someone in cc.
>>>>>>>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/xe/xe_device.c | 68
>>>>>>>> +++++++++++++++++++++++++++++++++-
>>>>>>>>>       1 file changed, 67 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>> cd241a8e1838..c97a4d1f0a98
>>>>>>>>> 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>> @@ -239,12 +239,78 @@ static long xe_drm_compat_ioctl(struct
>>>>>>>>> file *file,
>>>>>>>> unsigned int cmd, unsigned lo
>>>>>>>>>       #define xe_drm_compat_ioctl NULL
>>>>>>>>>       #endif
>>>>>>>>>
>>>>>>>>> +static void barrier_open(struct vm_area_struct *vma) {
>>>>>>>>> +	drm_dev_get(vma->vm_private_data);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void barrier_close(struct vm_area_struct *vma) {
>>>>>>>>> +	drm_dev_put(vma->vm_private_data);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static const struct vm_operations_struct vm_ops_barrier = {
>>>>>>>>> +	.open = barrier_open,
>>>>>>>>> +	.close = barrier_close,
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +static int xe_pci_barrier_mmap(struct file *filp,
>>>>>>>>> +			       struct vm_area_struct *vma) {
>>>>>>>>> +	struct drm_file *priv = filp->private_data;
>>>>>>>>> +	struct drm_device *dev = priv->minor->dev;
>>>>>>>>> +	unsigned long pfn;
>>>>>>>>> +	pgprot_t prot;
>>>>>>>>> +
>>>>>>>>> +	if (vma->vm_end - vma->vm_start > PAGE_SIZE)
>>>>>>>>> +		return -EINVAL;
>>>>>>>>> +
>>>>>>>>> +	if (is_cow_mapping(vma->vm_flags))
>>>>>>>>> +		return -EINVAL;
>>>>>>>>> +
>>>>>>>>> +	if (vma->vm_flags & (VM_READ | VM_EXEC))
>>>>>>>>> +		return -EINVAL;
>>>>>>>>> +
>>>>>>>>> +	vm_flags_clear(vma, VM_MAYREAD | VM_MAYEXEC);
>>>>>>>>> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
>>>>>>>> VM_DONTDUMP | VM_IO);
>>>>>>>>> +
>>>>>>>>> +	prot = vm_get_page_prot(vma->vm_flags); #define
>>>>>>>> LAST_DB_PAGE_OFFSET
>>>>>>>>> +0x7ff001
>>>>>>>>> +	pfn = PHYS_PFN(pci_resource_start(to_pci_dev(dev->dev), 0)
>>>> +
>>>>>>>>> +			LAST_DB_PAGE_OFFSET);
>>>>>>>>> +	if (vmf_insert_pfn_prot(vma, vma->vm_start, pfn,
>>>>>>>>> +				pgprot_noncached(prot)) !=
>>>>>>>> VM_FAULT_NOPAGE)
>>>>>>>>> +		return -EFAULT;
>>>>>>>>> +
>>>>>>>>> +	vma->vm_ops = &vm_ops_barrier;
>>>>>>>>> +	vma->vm_private_data = dev;
>>>>>>>>> +	drm_dev_get(vma->vm_private_data);
>>>>>>>>
>>>>>>>> What is going on here with pci power states? Say device goes into
>>>>>>>> d3hot or d3cold and you touch this mmio, what happens? Do you get
>>>>>>>> pci errors in dmesg? Does rpm need to get involved here? Might be
>>>>>>>> good to have a testcase for this.
>>>>>>>
>>>>>>> I am not sure here. But we will have follow up IGT patch once
>>>>>>> basic IGT is
>>>>>> merged, where we will try to add test you are suggesting here.
>>>>>>
>>>>>> I think we need to understand this. If we need rpm here, then the
>>>>>> design might need to change, since grabbing rpm for the lifetime of
>>>>>> the mmap is likely the same as just disabling rpm completely, which
>>>>>> is maybe not acceptable. Not sure though.
>>>>>
>>>>> we don't need rpm here as the membarrier is to generate a
>>>>> side-effect on the GT to serialise any memory traffic across the PCI
>>>>> bus; if the device is asleep, there is no memory traffic across the
>>>>> bus, thus the requirement is satisfied
>>>>
>>>> Ok, but would be good to confirm that. At least for stuff like pci
>>>> access with LMEMBAR, IIRC device must be in d0 or you can get pci
>>>> errors, even if writes are ignored.
>>>
>>> Tried d3hot with this test,
>>>
>>>           ptr = mmap(NULL, size, prot, flags, device.fd_xe, offset);
>>>           igt_assert(ptr != MAP_FAILED);
>>>
>>>           sleep(1);
>>>           igt_assert(in_d3(device, d_state));
>>>           /* Check whole page for any errors, also check as
>>>            * we should not read written values back
>>>            */
>>>           for (int i = 0; i < size / sizeof(*ptr); i++) {
>>>                   /* It is expected unconfigured doorbell space
>>>                    * will return read value 0xdeadbeef
>>>                    */
>>>                   igt_assert_eq_u32(READ_ONCE(ptr[i]), 0xdeadbeef);
>>>
>>>                   igt_gettime(&tv);
>>>                   ptr[i] = i;
>>>                   if (READ_ONCE(ptr[i]) == i) {
>>>                           while (READ_ONCE(ptr[i]) == i)
>>>                                   ;
>>>                           igt_info("fd:%d value retained for %"PRId64"ns pos:%d\n",
>>>                                    device.fd_xe, igt_nsec_elapsed(&tv), i);
>>>                   }
>>>                   igt_assert_neq(READ_ONCE(ptr[i]), i);
>>>           }
>>>
>>> I am putting device in D3 and then reading, I am able to read back
>> mappings.
>>
>> Ok, cool. And dmesg is also clean? I guess would be good to try with d3cold.
>> If we can turn the above into an igt, there should be something in CI that is
>> d3cold capable to check this.
> 
> Dmesg is also clean I see some suspend/resume messages as below,
> 
> [ 5740.084015] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT0: suspending
> [ 5740.084497] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT0: suspended
> [ 5740.084549] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT1: suspending
> [ 5740.085586] xe 0000:03:00.0: [drm:xe_gt_suspend [xe]] GT1: suspended
> [ 5740.195624] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT0: resuming
> [ 5740.218038] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT0: resumed
> [ 5740.218112] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT1: resuming
> [ 5741.289312] xe 0000:03:00.0: [drm:xe_gt_resume [xe]] GT1: resumed
> [ 5741.289472] [IGT] xe_pm: finished subtest d3hot-pci-membarrier, SUCCESS
> 
> I see some similar messages as above, before test passes completely. I did not find d3cold capable device, but I will try tomorrow to grab one and test.
> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +	return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int xe_mmap(struct file *filp, struct vm_area_struct *vma) {
>>>>>>>>> +	struct drm_file *priv = filp->private_data;
>>>>>>>>> +	struct drm_device *dev = priv->minor->dev;
>>>>>>>>> +
>>>>>>>>> +	if (drm_dev_is_unplugged(dev))
>>>>>>>>> +		return -ENODEV;
>>>>>>>>
>>>>>>>> With hotunplug I think there is also a special dummy page that
>>>>>>>> gets faulted
>>>>>> in
>>>>>>>> if we detect that we are unplugged in the fault handler to avoid
>>>>>>>> sigbus
>>>>>> killing
>>>>>>>> stuff (see xe_gem_fault()). Also I guess all mmaps are nuked on
>>>>>>>> unplug so everything gets refaulted. But this doesn't look to use
>>>>>>>> fault handler and
>>>>>> just
>>>>>>>> pre-populates it here, so wondering what would happen here?
>>>>>>>
>>>>>>> Got your point. This might have problem, Do you have any
>>>>>>> suggestion
>>>> here?
>>>>>>
>>>>>> Not really. But maybe first just try it with an igt and see what
>>>>>> happens? We can run hotunplug igt for example.
>>>>>
>>>>> If unplug happens then it's a new device at that point and the
>>>>> client needs
>>>> to reload, so I think we don’t need to take care this.
>>>>
>>>> At least for normal bo mmap stuff you usually don't want to crash the
>>>> app/umd, hence why a lot of effort is spent trying to avoid returning
>>>> SIGBUS by instead mapping a dummy page if we detect device was
>>>> detached in the fault handler. The user will need to get a new driver
>>>> fd (assuming device comes back), but user should be able to
>>>> gracefully detect the unplug instead of crashing (all iocts will return an
>> error for example).
>>>>
>>>> When you do the unplug the driver instance stays alive, even after
>>>> pci remove callback is called, where that driver instance is detached
>>>> from the pci device underneath. If user is accessing the barrier ptr
>>>> when this happens does it crash? Would be good to check that.
>>>
>>> Created fault handler and did remapping and I see test is passing post
>> unplug also.
>>
>> You remapped it to the dummy page? Yeah, that should do the trick, I think.
> 
> I remapped actual physical page in fault handler as it will only return "deadbeef" which UMD test with pci barrier is expecting.

I don't think it's normally valid to touch physical device stuff like 
mmio post unplug. The physical device underneath might be completely 
gone, or attached to new driver instance, hence the dummy page thing.

> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +#define XE_PCI_BARRIER_MMAP_OFFSET (0x50 << PAGE_SHIFT)
>>>>>>>>
>>>>>>>> This needs to go in uapi header I think?
>>>>>>>
>>>>>>> Yes I went though uapi header, did not see any offset there, did
>>>>>>> not find
>>>>>> suitable place. Any suggestion where should this be positioned in
>>>>>> uapi header?
>>>>>>
>>>>>> It might make sense to have mmap_offest flag instead (as per
>>>>>> below), then we don't need this.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +	switch (vma->vm_pgoff) {
>>>>>>>>
>>>>>>>> I think pgoff is the fake offset to be able to lookup the bo, so
>>>>>>>> for this to
>>>>>> work I
>>>>>>>> think you would need to reserve this page somehow (if we go with
>>>>>>>> this approach), otherwise user can easily end up mmapping a
>>>>>>>> normal bo
>>>> and
>>>>>> get
>>>>>>>> back a fake offset that happens to be this exact special barrier
>>>>>>>> offset and
>>>>>> then
>>>>>>>> things go boom.
>>>>>>>>
>>>>>>>> Another idea could be to have mmap_offset called with special
>>>>>>>> flag to indicate this BARRIER thing, and then it returns the
>>>>>>>> magic reserved
>>>> offset.
>>>>>>>> That way we don't need to hardcode
>> XE_PCI_BARRIER_MMAP_OFFSET
>>>> in
>>>>>> the
>>>>>>>> uapi, and be forever stuck with it.
>>>>>>>
>>>>>>> Are you saying on different platforms this unused offset might change.
>>>> For
>>>>>> current purpose BMG and PVC both has correct value for this as
>>>>>> unused offset.
>>>>>>
>>>>>> Why is it an unused offset, also why is it platform specific? The
>>>>>> vm_pgoff can also contain the fake offset to lookup the BO, so you
>>>>>> would now need to ensure that the special barrier offset is never
>>>>>> used, and so doesn't clash with some random BO offset. Maybe I'm
>>>>>> missing something here...
>>>>>
>>>>> Oh now I recall, for vm_pgoff, there is some offset added, which is
>>>>> leaving
>>>> anything below that offset untouched. We selected vm_pgoff to be used
>>>> from that area.
>>>>
>>>> Ah, you mean DRM_FILE_PAGE_OFFSET_START? It does feel a bit scary to
>>>> rely on that internal behaviour and then bake it into the uapi forever.
>>>> So far I assume drivers are not really depending on that behaviour so
>>>> that could change in the future.
>>>
>>> Are you insisting this change now or we leave it for future?
>>
>> IMO we should at least consider not hard coding the offset in the uapi, but
>> maybe just return it from mmap_offset when passing a special flag, with no
>> hard guarentee that the offset can't change if kmd so chooses?
>> What do you think?
>>
>> Also if we want to keep the magic 0x50 instead of allocating an actual offset it
>> might be good to add a build check against DRM_FILE_PAGE_OFFSET_START
>> somewhere?
> 
> Build check looks good to me still let me go through both option tomorrow and get back with V2 on this. I will get this in separate patch.
> 
> Tejas
>>
>>>
>>> Tejas
>>>>
>>>>>
>>>>> Tejas
>>>>>
>>>>>>
>>>>>>>
>>>>>>> But I agree for future compatibility, we can have this handling in
>>>>>>> KMD
>>>> itself
>>>>>> to detect platform and provide correct unused defined offset to UMD
>>>>>> to
>>>> map.
>>>>>> I will take care of this.
>>>>>>>
>>>>>>> Tejas
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> +	case XE_PCI_BARRIER_MMAP_OFFSET >> PAGE_SHIFT:
>>>>>>>>> +		return xe_pci_barrier_mmap(filp, vma);
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	return drm_gem_mmap(filp, vma); }
>>>>>>>>> +
>>>>>>>>>       static const struct file_operations xe_driver_fops = {
>>>>>>>>>       	.owner = THIS_MODULE,
>>>>>>>>>       	.open = drm_open,
>>>>>>>>>       	.release = drm_release_noglobal,
>>>>>>>>>       	.unlocked_ioctl = xe_drm_ioctl,
>>>>>>>>> -	.mmap = drm_gem_mmap,
>>>>>>>>> +	.mmap = xe_mmap,
>>>>>>>>>       	.poll = drm_poll,
>>>>>>>>>       	.read = drm_read,
>>>>>>>>>       	.compat_ioctl = xe_drm_compat_ioctl,