[PATCH] drm/amdkfd: reserve the BO before validating it

Yu, Lang Lang.Yu at amd.com
Fri Jan 12 07:39:47 UTC 2024


[Public]

>-----Original Message-----
>From: Kuehling, Felix <Felix.Kuehling at amd.com>
>Sent: Friday, January 12, 2024 12:19 AM
>To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org; Francis, David
><David.Francis at amd.com>
>Cc: Yang, Philip <Philip.Yang at amd.com>
>Subject: Re: [PATCH] drm/amdkfd: reserve the BO before validating it
>
>On 2024-01-11 02:22, Lang Yu wrote:
>> Fixes: 410f08516e0f ("drm/amdkfd: Move dma unmapping after TLB flush")
>>
>> [   41.708711] WARNING: CPU: 0 PID: 1463 at
>drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.708989] Call Trace:
>> [   41.708992]  <TASK>
>> [   41.708996]  ? show_regs+0x6c/0x80
>> [   41.709000]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709008]  ? __warn+0x93/0x190
>> [   41.709014]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709024]  ? report_bug+0x1f9/0x210
>> [   41.709035]  ? handle_bug+0x46/0x80
>> [   41.709041]  ? exc_invalid_op+0x1d/0x80
>> [   41.709048]  ? asm_exc_invalid_op+0x1f/0x30
>> [   41.709057]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709185]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709197]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709337]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709346]  kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu]
>> [   41.709467]  amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80
>[amdgpu]
>> [   41.709586]  kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu]
>> [   41.709710]  kfd_ioctl+0x1ec/0x650 [amdgpu]
>> [   41.709822]  ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10
>[amdgpu]
>> [   41.709945]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709949]  ? tomoyo_file_ioctl+0x20/0x30
>> [   41.709959]  __x64_sys_ioctl+0x9c/0xd0
>> [   41.709967]  do_syscall_64+0x3f/0x90
>> [   41.709973]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>
>> Signed-off-by: Lang Yu <Lang.Yu at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 48697b789342..f5542a4ab8ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2095,8 +2095,13 @@ void
>amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>*drm_priv)
>>      mutex_lock(&mem->lock);
>>
>>      list_for_each_entry(entry, &mem->attachments, list) {
>> -            if (entry->bo_va->base.vm == vm)
>> +            if (entry->bo_va->base.vm != vm)
>> +                    continue;
>> +
>> +            if (!WARN_ON(amdgpu_bo_reserve(entry->bo_va->base.bo,
>true))) {
>>                      kfd_mem_dmaunmap_attachment(mem, entry);
>> +                    amdgpu_bo_unreserve(entry->bo_va->base.bo);
>> +            }
>
>I'm pretty sure someone else worked on a fix for this before. This is not a good
>solution. We need to handle failed reservations (due to
>ERESTARTSYS) and make sure that the unmap ioctl can be restarted correctly in
>that case.
>
>See
>https://lore.kernel.org/amd-gfx/530aac57-5561-4d1d-879a-
>93b108e5c8c2 at gmail.com/

Got it. Thanks.

Christian's concern is
"Kernel operations should either completely fail, fully complete or
explicitly return how much they completed (e.g. how many bytes
transferred for example). That we only partially complete and track that
state inside the kernel is usually a no-go."

ERESTART_RESTARTBLOCK is for partially restart and may help.

Regards,
Lang

>David, do you have any update on this work?
>
>Regards,
>   Felix
>
>
>>      }
>>
>>      mutex_unlock(&mem->lock);


More information about the amd-gfx mailing list