[PATCH] drm/xe/mmap: Add mmap support for PCI memory barrier

Mon Oct 14 11:03:54 UTC 2024

On 14/10/2024 11:23, Upadhyay, Tejas wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld at intel.com>
>> Sent: Wednesday, October 9, 2024 7:27 PM
>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>> xe at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>> barrier
>>
>> Hi,
>>
>> On 09/10/2024 11:10, Tejas Upadhyay wrote:
>>> In order to avoid having userspace to use MI_MEM_FENCE, we are adding
>>> a mechanism for userspace to generate a PCI memory barrier with low
>>> overhead (avoiding IOCTL call).
>>
>> I guess it doesn't work if we just do VRAM write from the host side?
> 
> Writing to VRAM will adds some overhead which UMD wants to avoid. So the suggestion came out was to use special MMIO page mapped directly which is not accessible from the PCI bus so that the MMIO writes themselves are ignored, but the PCI memory barrier will still take action as the MMIO filtering will happen after the memory barrier effect.

Would be good to add this info to the commit message.

> 
>>
>>>
>>> This is implemented by memory-mapping a page as uncached that is
>>> backed by MMIO on the dGPU and thus allowing userspace to do memory
>>> write to the page without invoking an IOCTL.
>>> We are selecting the MMIO so that it is not accessible from the PCI
>>> bus so that the MMIO writes themselves are ignored, but the PCI memory
>>> barrier will still take action as the MMIO filtering will happen after
>>> the memory barrier effect.
>>>
>>> When we detect special defined offset in mmap(), We are mapping 4K
>>> page which contains the last of page of doorbell MMIO range to
>>> userspace for same purpose.
>>>
>>> Note: Test coverage for this is added by IGT
>>>         https://patchwork.freedesktop.org/patch/618931/  here.
>>>
>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
>>
>> Needs some UMD Cc. This was requested by some UMD?
> 
> Yes, I will add someone in cc.
> 
>>
>>> ---
>>>    drivers/gpu/drm/xe/xe_device.c | 68
>> +++++++++++++++++++++++++++++++++-
>>>    1 file changed, 67 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>> b/drivers/gpu/drm/xe/xe_device.c index cd241a8e1838..c97a4d1f0a98
>>> 100644
>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>> @@ -239,12 +239,78 @@ static long xe_drm_compat_ioctl(struct file *file,
>> unsigned int cmd, unsigned lo
>>>    #define xe_drm_compat_ioctl NULL
>>>    #endif
>>>
>>> +static void barrier_open(struct vm_area_struct *vma) {
>>> +	drm_dev_get(vma->vm_private_data);
>>> +}
>>> +
>>> +static void barrier_close(struct vm_area_struct *vma) {
>>> +	drm_dev_put(vma->vm_private_data);
>>> +}
>>> +
>>> +static const struct vm_operations_struct vm_ops_barrier = {
>>> +	.open = barrier_open,
>>> +	.close = barrier_close,
>>> +};
>>> +
>>> +static int xe_pci_barrier_mmap(struct file *filp,
>>> +			       struct vm_area_struct *vma) {
>>> +	struct drm_file *priv = filp->private_data;
>>> +	struct drm_device *dev = priv->minor->dev;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +
>>> +	if (vma->vm_end - vma->vm_start > PAGE_SIZE)
>>> +		return -EINVAL;
>>> +
>>> +	if (is_cow_mapping(vma->vm_flags))
>>> +		return -EINVAL;
>>> +
>>> +	if (vma->vm_flags & (VM_READ | VM_EXEC))
>>> +		return -EINVAL;
>>> +
>>> +	vm_flags_clear(vma, VM_MAYREAD | VM_MAYEXEC);
>>> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
>> VM_DONTDUMP | VM_IO);
>>> +
>>> +	prot = vm_get_page_prot(vma->vm_flags); #define
>> LAST_DB_PAGE_OFFSET
>>> +0x7ff001
>>> +	pfn = PHYS_PFN(pci_resource_start(to_pci_dev(dev->dev), 0) +
>>> +			LAST_DB_PAGE_OFFSET);
>>> +	if (vmf_insert_pfn_prot(vma, vma->vm_start, pfn,
>>> +				pgprot_noncached(prot)) !=
>> VM_FAULT_NOPAGE)
>>> +		return -EFAULT;
>>> +
>>> +	vma->vm_ops = &vm_ops_barrier;
>>> +	vma->vm_private_data = dev;
>>> +	drm_dev_get(vma->vm_private_data);
>>
>> What is going on here with pci power states? Say device goes into d3hot or
>> d3cold and you touch this mmio, what happens? Do you get pci errors in
>> dmesg? Does rpm need to get involved here? Might be good to have a
>> testcase for this.
> 
> I am not sure here. But we will have follow up IGT patch once basic IGT is merged, where we will try to add test you are suggesting here.

I think we need to understand this. If we need rpm here, then the design 
might need to change, since grabbing rpm for the lifetime of the mmap is 
likely the same as just disabling rpm completely, which is maybe not 
acceptable. Not sure though.

> 
>>
>>> +	return 0;
>>> +}
>>> +
>>> +static int xe_mmap(struct file *filp, struct vm_area_struct *vma) {
>>> +	struct drm_file *priv = filp->private_data;
>>> +	struct drm_device *dev = priv->minor->dev;
>>> +
>>> +	if (drm_dev_is_unplugged(dev))
>>> +		return -ENODEV;
>>
>> With hotunplug I think there is also a special dummy page that gets faulted in
>> if we detect that we are unplugged in the fault handler to avoid sigbus killing
>> stuff (see xe_gem_fault()). Also I guess all mmaps are nuked on unplug so
>> everything gets refaulted. But this doesn't look to use fault handler and just
>> pre-populates it here, so wondering what would happen here?
> 
> Got your point. This might have problem, Do you have any suggestion here?

Not really. But maybe first just try it with an igt and see what 
happens? We can run hotunplug igt for example.

>   
>>
>>> +
>>> +#define XE_PCI_BARRIER_MMAP_OFFSET (0x50 << PAGE_SHIFT)
>>
>> This needs to go in uapi header I think?
> 
> Yes I went though uapi header, did not see any offset there, did not find suitable place. Any suggestion where should this be positioned in uapi header?

It might make sense to have mmap_offest flag instead (as per below), 
then we don't need this.

> 
>>
>>> +	switch (vma->vm_pgoff) {
>>
>> I think pgoff is the fake offset to be able to lookup the bo, so for this to work I
>> think you would need to reserve this page somehow (if we go with this
>> approach), otherwise user can easily end up mmapping a normal bo and get
>> back a fake offset that happens to be this exact special barrier offset and then
>> things go boom.
>>
>> Another idea could be to have mmap_offset called with special flag to
>> indicate this BARRIER thing, and then it returns the magic reserved offset.
>> That way we don't need to hardcode XE_PCI_BARRIER_MMAP_OFFSET in the
>> uapi, and be forever stuck with it.
> 
> Are you saying on different platforms this unused offset might change. For current purpose BMG and PVC both has correct value for this as unused offset.

Why is it an unused offset, also why is it platform specific? The 
vm_pgoff can also contain the fake offset to lookup the BO, so you would 
now need to ensure that the special barrier offset is never used, and so 
doesn't clash with some random BO offset. Maybe I'm missing something 
here...

> 
> But I agree for future compatibility, we can have this handling in KMD itself to detect platform and provide correct unused defined offset to UMD to map. I will take care of this.
> 
> Tejas
>   
>>
>>
>>> +	case XE_PCI_BARRIER_MMAP_OFFSET >> PAGE_SHIFT:
>>> +		return xe_pci_barrier_mmap(filp, vma);
>>> +	}
>>> +
>>> +	return drm_gem_mmap(filp, vma);
>>> +}
>>> +
>>>    static const struct file_operations xe_driver_fops = {
>>>    	.owner = THIS_MODULE,
>>>    	.open = drm_open,
>>>    	.release = drm_release_noglobal,
>>>    	.unlocked_ioctl = xe_drm_ioctl,
>>> -	.mmap = drm_gem_mmap,
>>> +	.mmap = xe_mmap,
>>>    	.poll = drm_poll,
>>>    	.read = drm_read,
>>>    	.compat_ioctl = xe_drm_compat_ioctl,