[PATCH] drm/xe/mmap: Add mmap support for PCI memory barrier

Mon Oct 14 13:53:01 UTC 2024

On 14/10/2024 14:08, Upadhyay, Tejas wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld at intel.com>
>> Sent: Monday, October 14, 2024 4:34 PM
>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>> xe at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>> barrier
>>
>> On 14/10/2024 11:23, Upadhyay, Tejas wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>> Sent: Wednesday, October 9, 2024 7:27 PM
>>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
>>>> xe at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/xe/mmap: Add mmap support for PCI memory
>>>> barrier
>>>>
>>>> Hi,
>>>>
>>>> On 09/10/2024 11:10, Tejas Upadhyay wrote:
>>>>> In order to avoid having userspace to use MI_MEM_FENCE, we are
>>>>> adding a mechanism for userspace to generate a PCI memory barrier
>>>>> with low overhead (avoiding IOCTL call).
>>>>
>>>> I guess it doesn't work if we just do VRAM write from the host side?
>>>
>>> Writing to VRAM will adds some overhead which UMD wants to avoid. So
>> the suggestion came out was to use special MMIO page mapped directly
>> which is not accessible from the PCI bus so that the MMIO writes themselves
>> are ignored, but the PCI memory barrier will still take action as the MMIO
>> filtering will happen after the memory barrier effect.
>>
>> Would be good to add this info to the commit message.
> 
> Ok.
> 
>>
>>>
>>>>
>>>>>
>>>>> This is implemented by memory-mapping a page as uncached that is
>>>>> backed by MMIO on the dGPU and thus allowing userspace to do
>> memory
>>>>> write to the page without invoking an IOCTL.
>>>>> We are selecting the MMIO so that it is not accessible from the PCI
>>>>> bus so that the MMIO writes themselves are ignored, but the PCI
>>>>> memory barrier will still take action as the MMIO filtering will
>>>>> happen after the memory barrier effect.
>>>>>
>>>>> When we detect special defined offset in mmap(), We are mapping 4K
>>>>> page which contains the last of page of doorbell MMIO range to
>>>>> userspace for same purpose.
>>>>>
>>>>> Note: Test coverage for this is added by IGT
>>>>>          https://patchwork.freedesktop.org/patch/618931/  here.
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
>>>>
>>>> Needs some UMD Cc. This was requested by some UMD?
>>>
>>> Yes, I will add someone in cc.
>>>
>>>>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_device.c | 68
>>>> +++++++++++++++++++++++++++++++++-
>>>>>     1 file changed, 67 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>> b/drivers/gpu/drm/xe/xe_device.c index cd241a8e1838..c97a4d1f0a98
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>> @@ -239,12 +239,78 @@ static long xe_drm_compat_ioctl(struct file
>>>>> *file,
>>>> unsigned int cmd, unsigned lo
>>>>>     #define xe_drm_compat_ioctl NULL
>>>>>     #endif
>>>>>
>>>>> +static void barrier_open(struct vm_area_struct *vma) {
>>>>> +	drm_dev_get(vma->vm_private_data);
>>>>> +}
>>>>> +
>>>>> +static void barrier_close(struct vm_area_struct *vma) {
>>>>> +	drm_dev_put(vma->vm_private_data);
>>>>> +}
>>>>> +
>>>>> +static const struct vm_operations_struct vm_ops_barrier = {
>>>>> +	.open = barrier_open,
>>>>> +	.close = barrier_close,
>>>>> +};
>>>>> +
>>>>> +static int xe_pci_barrier_mmap(struct file *filp,
>>>>> +			       struct vm_area_struct *vma) {
>>>>> +	struct drm_file *priv = filp->private_data;
>>>>> +	struct drm_device *dev = priv->minor->dev;
>>>>> +	unsigned long pfn;
>>>>> +	pgprot_t prot;
>>>>> +
>>>>> +	if (vma->vm_end - vma->vm_start > PAGE_SIZE)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	if (is_cow_mapping(vma->vm_flags))
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	if (vma->vm_flags & (VM_READ | VM_EXEC))
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	vm_flags_clear(vma, VM_MAYREAD | VM_MAYEXEC);
>>>>> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
>>>> VM_DONTDUMP | VM_IO);
>>>>> +
>>>>> +	prot = vm_get_page_prot(vma->vm_flags); #define
>>>> LAST_DB_PAGE_OFFSET
>>>>> +0x7ff001
>>>>> +	pfn = PHYS_PFN(pci_resource_start(to_pci_dev(dev->dev), 0) +
>>>>> +			LAST_DB_PAGE_OFFSET);
>>>>> +	if (vmf_insert_pfn_prot(vma, vma->vm_start, pfn,
>>>>> +				pgprot_noncached(prot)) !=
>>>> VM_FAULT_NOPAGE)
>>>>> +		return -EFAULT;
>>>>> +
>>>>> +	vma->vm_ops = &vm_ops_barrier;
>>>>> +	vma->vm_private_data = dev;
>>>>> +	drm_dev_get(vma->vm_private_data);
>>>>
>>>> What is going on here with pci power states? Say device goes into
>>>> d3hot or d3cold and you touch this mmio, what happens? Do you get pci
>>>> errors in dmesg? Does rpm need to get involved here? Might be good to
>>>> have a testcase for this.
>>>
>>> I am not sure here. But we will have follow up IGT patch once basic IGT is
>> merged, where we will try to add test you are suggesting here.
>>
>> I think we need to understand this. If we need rpm here, then the design
>> might need to change, since grabbing rpm for the lifetime of the mmap is
>> likely the same as just disabling rpm completely, which is maybe not
>> acceptable. Not sure though.
> 
> we don't need rpm here as the membarrier is to generate a side-effect on the GT to serialise any memory traffic across the PCI bus; if the device is asleep, there is no memory traffic across the bus, thus the requirement is satisfied

Ok, but would be good to confirm that. At least for stuff like pci 
access with LMEMBAR, IIRC device must be in d0 or you can get pci 
errors, even if writes are ignored.

> 
>>
>>>
>>>>
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static int xe_mmap(struct file *filp, struct vm_area_struct *vma) {
>>>>> +	struct drm_file *priv = filp->private_data;
>>>>> +	struct drm_device *dev = priv->minor->dev;
>>>>> +
>>>>> +	if (drm_dev_is_unplugged(dev))
>>>>> +		return -ENODEV;
>>>>
>>>> With hotunplug I think there is also a special dummy page that gets faulted
>> in
>>>> if we detect that we are unplugged in the fault handler to avoid sigbus
>> killing
>>>> stuff (see xe_gem_fault()). Also I guess all mmaps are nuked on unplug so
>>>> everything gets refaulted. But this doesn't look to use fault handler and
>> just
>>>> pre-populates it here, so wondering what would happen here?
>>>
>>> Got your point. This might have problem, Do you have any suggestion here?
>>
>> Not really. But maybe first just try it with an igt and see what
>> happens? We can run hotunplug igt for example.
> 
> If unplug happens then it's a new device at that point and the client needs to reload, so I think we don’t need to take care this.

At least for normal bo mmap stuff you usually don't want to crash the 
app/umd, hence why a lot of effort is spent trying to avoid returning 
SIGBUS by instead mapping a dummy page if we detect device was detached 
in the fault handler. The user will need to get a new driver fd 
(assuming device comes back), but user should be able to gracefully 
detect the unplug instead of crashing (all iocts will return an error 
for example).

When you do the unplug the driver instance stays alive, even after pci 
remove callback is called, where that driver instance is detached from 
the pci device underneath. If user is accessing the barrier ptr when 
this happens does it crash? Would be good to check that.

> 
>>
>>>
>>>>
>>>>> +
>>>>> +#define XE_PCI_BARRIER_MMAP_OFFSET (0x50 << PAGE_SHIFT)
>>>>
>>>> This needs to go in uapi header I think?
>>>
>>> Yes I went though uapi header, did not see any offset there, did not find
>> suitable place. Any suggestion where should this be positioned in uapi
>> header?
>>
>> It might make sense to have mmap_offest flag instead (as per below),
>> then we don't need this.
>>
>>>
>>>>
>>>>> +	switch (vma->vm_pgoff) {
>>>>
>>>> I think pgoff is the fake offset to be able to lookup the bo, so for this to
>> work I
>>>> think you would need to reserve this page somehow (if we go with this
>>>> approach), otherwise user can easily end up mmapping a normal bo and
>> get
>>>> back a fake offset that happens to be this exact special barrier offset and
>> then
>>>> things go boom.
>>>>
>>>> Another idea could be to have mmap_offset called with special flag to
>>>> indicate this BARRIER thing, and then it returns the magic reserved offset.
>>>> That way we don't need to hardcode XE_PCI_BARRIER_MMAP_OFFSET in
>> the
>>>> uapi, and be forever stuck with it.
>>>
>>> Are you saying on different platforms this unused offset might change. For
>> current purpose BMG and PVC both has correct value for this as unused
>> offset.
>>
>> Why is it an unused offset, also why is it platform specific? The
>> vm_pgoff can also contain the fake offset to lookup the BO, so you would
>> now need to ensure that the special barrier offset is never used, and so
>> doesn't clash with some random BO offset. Maybe I'm missing something
>> here...
> 
> Oh now I recall, for vm_pgoff, there is some offset added, which is leaving anything below that offset untouched. We selected vm_pgoff to be used from that area.

Ah, you mean DRM_FILE_PAGE_OFFSET_START? It does feel a bit scary to 
rely on that internal behaviour and then bake it into the uapi forever. 
So far I assume drivers are not really depending on that behaviour so 
that could change in the future.

>   
> Tejas
>   
>>
>>>
>>> But I agree for future compatibility, we can have this handling in KMD itself
>> to detect platform and provide correct unused defined offset to UMD to map.
>> I will take care of this.
>>>
>>> Tejas
>>>
>>>>
>>>>
>>>>> +	case XE_PCI_BARRIER_MMAP_OFFSET >> PAGE_SHIFT:
>>>>> +		return xe_pci_barrier_mmap(filp, vma);
>>>>> +	}
>>>>> +
>>>>> +	return drm_gem_mmap(filp, vma);
>>>>> +}
>>>>> +
>>>>>     static const struct file_operations xe_driver_fops = {
>>>>>     	.owner = THIS_MODULE,
>>>>>     	.open = drm_open,
>>>>>     	.release = drm_release_noglobal,
>>>>>     	.unlocked_ioctl = xe_drm_ioctl,
>>>>> -	.mmap = drm_gem_mmap,
>>>>> +	.mmap = xe_mmap,
>>>>>     	.poll = drm_poll,
>>>>>     	.read = drm_read,
>>>>>     	.compat_ioctl = xe_drm_compat_ioctl,