[PATCH] drm/amdgpu: Fix vm free pts race when process exiting
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Aug 4 15:32:26 UTC 2021
Am 04.08.21 um 17:11 schrieb philip yang:
>
> On 2021-08-04 5:01 a.m., Christian König wrote:
>
>> Am 03.08.21 um 00:33 schrieb Philip Yang:
>>> Take vm->invalidated_lock spinlock to remove vm pd and pt bos, to avoid
>>> link list corruption with crash backtrace:
>>>
>>> [ 2290.505111] list_del corruption. next->prev should be
>>> ffff9b2514ec0018, but was 4e03280211010f04
>>> [ 2290.505154] kernel BUG at lib/list_debug.c:56!
>>> [ 2290.505176] invalid opcode: 0000 [#1] SMP NOPTI
>>> [ 2290.505254] Workqueue: events delayed_fput
>>> [ 2290.505271] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
>>> [ 2290.505513] Call Trace:
>>> [ 2290.505623] amdgpu_vm_free_table+0x26/0x80 [amdgpu]
>>> [ 2290.505705] amdgpu_vm_free_pts+0x7a/0xf0 [amdgpu]
>>> [ 2290.505786] amdgpu_vm_fini+0x1f0/0x440 [amdgpu]
>>> [ 2290.505864] amdgpu_driver_postclose_kms+0x172/0x290 [amdgpu]
>>> [ 2290.505893] drm_file_free.part.10+0x1d4/0x270 [drm]
>>> [ 2290.505916] drm_release+0xa9/0xe0 [drm]
>>> [ 2290.505930] __fput+0xb7/0x230
>>> [ 2290.505942] delayed_fput+0x1c/0x30
>>> [ 2290.505957] process_one_work+0x1a7/0x360
>>> [ 2290.505971] worker_thread+0x30/0x390
>>> [ 2290.505985] ? create_worker+0x1a0/0x1a0
>>> [ 2290.505999] kthread+0x112/0x130
>>> [ 2290.506011] ? kthread_flush_work_fn+0x10/0x10
>>> [ 2290.506027] ret_from_fork+0x22/0x40
>>
>> Wow, well this is a big NAK.
>>
>> The page tables should never ever be on the invalidation list or
>> otherwise we would try to point PTEs to them which is a huge security
>> issue.
>>
>> Taking the lock just workaround that. Can you investigate how it
>> happens that a page table ends up on that list?
>
> I will be off 3 days, I can investigate this further next Monday. From
> debug, amdgpu_vm_bo_invalidate is called many times. The background:
> app has 8 processes, running on 8 GPUs, 64 VMs, VRAM usage is around
> 85% from rocm-info, 1/5 chance this happens (kernel BUG, server die)
> after application segmentation fault crash (another app issue). It is
> new issue on rocm release 4.3, never happened before on rocm 4.1 (no
> app crash on 4.1 either). kernel version is 4.18 on CentOS.
>
It could be a bug in the VM code. I'm on vacation till 16th of August,
but I will try to take a look when I have time.
Thanks for the notice,
Christian.
> Regards,
>
> Philip
>
>>
>> Thanks in advance,
>> Christian.
>>
>>>
>>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 2a88ed5d983b..5c4c355e7d6b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1045,7 +1045,9 @@ static void amdgpu_vm_free_table(struct
>>> amdgpu_vm_bo_base *entry)
>>> return;
>>> shadow = amdgpu_bo_shadowed(entry->bo);
>>> entry->bo->vm_bo = NULL;
>>> + spin_lock(&entry->vm->invalidated_lock);
>>> list_del(&entry->vm_status);
>>> + spin_unlock(&entry->vm->invalidated_lock);
>>> amdgpu_bo_unref(&shadow);
>>> amdgpu_bo_unref(&entry->bo);
>>> }
>>
More information about the amd-gfx
mailing list