[PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Tue Mar 30 12:55:01 UTC 2021

Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>> Hi Christian,
>>>>
>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:
>>>>
>>>>           /* validate the parameters */
>>>>           if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>               size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>                   return -EINVAL;
>>>>
>>>> /* snip */
>>>>
>>>>           saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>           eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>
>>>> /* snip */
>>>>
>>>>           mapping->start = saddr;
>>>>           mapping->last = eaddr;
>>>>
>>>>
>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>> "AMDGPU_GPU_PAGE_MASK"
>>>> in "validate the parameters" with "PAGE_MASK".
> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> "AMDGPU_GPU_PAGE_MASK" :(.
>
>>> Yeah, good point.
>>>
>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>> this
>>>> change.
>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>> basis while the native page size of the CPU is larger. Let me double
>>> check the code.
> On my platform, there are two issues both causing the VM corruption.  One is
> fixed by agd5f/linux at fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't 
find it.

>    Another is in Mesa from userspace:  it uses
> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and 
the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the 
virtual_address_alignment isn't.

> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> and make virtual_address_alignment = 4K.  Otherwise, we should fortify the
> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the
> userspace will just get an EINVAL, instead of a slient VM corruption.  And
> someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment 
fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 64beb3399604..e1260b517e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 
*data, struct drm_file *filp)
                 }
                 dev_info->virtual_address_alignment = 
max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
                 dev_info->pte_fragment_size = (1 << 
adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
-               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
+               dev_info->gart_page_size = 
dev_info->virtual_address_alignment;
                 dev_info->cu_active_number = adev->gfx.cu_info.number;
                 dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
                 dev_info->ce_ram_size = adev->gfx.ce_ram_size;


Thanks,
Christian.

> --
> Xi Ruoyao <xry111 at mengyan1223.wang>
> School of Aerospace Science and Technology, Xidian University
>