[PATCH] drm/kfd: fix ttm_bo_release warning

Yu, Lang Lang.Yu at amd.com
Fri Sep 24 10:37:44 UTC 2021


[AMD Official Use Only]



>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken at gmail.com>
>Sent: Friday, September 24, 2021 2:37 PM
>To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
><Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>Cc: Kuehling, Felix <Felix.Kuehling at amd.com>; Huang, Ray
><Ray.Huang at amd.com>
>Subject: Re: [PATCH] drm/kfd: fix ttm_bo_release warning
>
>Am 24.09.21 um 08:34 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Friday, September 24, 2021 1:54 PM
>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Kuehling, Felix <Felix.Kuehling at amd.com>; Huang, Ray
>>> <Ray.Huang at amd.com>
>>> Subject: Re: [PATCH] drm/kfd: fix ttm_bo_release warning
>>>
>>>
>>> Am 24.09.21 um 07:50 schrieb Yu, Lang:
>>>> [AMD Official Use Only]
>>>>> [SNIP]
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Thanks for your explanation and advice. I got your point.
>>>>>>>> Actually, these BOs are allocated and pinned during a kfd process
>lifecycle.
>>>>>>>> I will try to add a flag into struct kgd_mem to indicate which
>>>>>>>> BO is pined and should be unpinned. Which will make
>>>>>>>> amdgpu_bo_pin/amdgpu_bo_unpin calls balanced. Thanks!
>>>>>>> That isn't to much better. The real solution would be to unpin
>>>>>>> them when the kfd process is destroyed.
>>>>>> Yes, will unpin them when the kfd process is destroyed.
>>>>>> But we should indicate which BO is pinned and should be unpinned. Right?
>>>>> Well not with a flag or something like that.
>>>>>
>>>>> The knowledge which BO is pinned and needs to be unpinned should
>>>>> come from the control logic and not be papered over by some general
>handling.
>>>>> That's the background why we have removed the general handling for
>>>>> this from TTM in the first place.
>>>>>
>>>>> In other words when you need to pin a BO because it is kmapped it
>>>>> should be unpinned when it is kunmapped and if you don't kunmap at
>>>>> all then there is something wrong with the code structure from a
>>>>> higher level
>>> point of view.
>>>> Yes, this function "amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel" did a
>>>> kmap, but without a kunmap when the kfd process is destroyed. The
>>>> flag
>>> actually indicates kmap/kunmap.
>>>
>>> Well that's the wrong approach then. I mean you need to have the BO
>>> reference and the mapped pointer somewhere, don't you?
>>>
>>> How do you clean those up?
>> They are respectively cleaned by " kfd_process_device_free_bos " and "
>kfd_process_destroy_pdds".
>> Let me put the code here. Thanks!
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index ec028cf963f5..d65b3bf13fd8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -81,6 +81,7 @@ struct kgd_mem {
>>
>>          bool aql_queue;
>>          bool is_imported;
>> +       bool is_mapped_to_kernel;
>
>Yeah, that is exactly what you absolutely should NOT do.
>
>>   };
>>
>>   /* KFD Memory Eviction */
>> @@ -278,6 +279,8 @@ int amdgpu_amdkfd_gpuvm_sync_memory(
>>                  struct kgd_dev *kgd, struct kgd_mem *mem, bool intr);
>>   int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd,
>>                  struct kgd_mem *mem, void **kptr, uint64_t *size);
>
>The real question is who is calling this function here?

Currently  there are 3 places called function "amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel" to kmap a BO. 

1, kmap a ptr for pdd->qpd->cwsr_kaddr
Call stack:
amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel
kfd_process_alloc_gpuvm
kfd_process_device_init_cwsr_dgpu
kfd_process_device_init_vm
kfd_ioctl_acquire_vm

2, kmap a ptr for pdd->qpd->ib_kaddr
Call stack:
amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel
kfd_process_alloc_gpuvm
kfd_process_device_reserve_ib_mem
kfd_process_device_init_vm
kfd_ioctl_acquire_vm

3, kmap a ptr for p->signal_page->kernel_address
Call stack:
amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel
kfd_ioctl_create_event

The problem is these kmaped BOs were not kunmaped properly 
when the kfd process is destroyed.

Regards,
Lang

>> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev
>*kgd,
>> +               struct kgd_mem *mem);
>>   int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info,
>>                                              struct dma_fence **ef);
>>   int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd, diff
>> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 2d6b2d77b738..45ccbe9f63ee 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1857,6 +1857,8 @@ int
>> amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd,
>>
>>          amdgpu_bo_unreserve(bo);
>>
>> +       mem->is_mapped_to_kernel = true;
>> +
>>          mutex_unlock(&mem->process_info->lock);
>>          return 0;
>>
>> @@ -1870,6 +1872,20 @@ int
>amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd,
>>          return ret;
>>   }
>>
>> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev
>> +*kgd, struct kgd_mem *mem) {
>> +       struct amdgpu_bo *bo = mem->bo;
>> +
>> +       if (!mem->is_mapped_to_kernel)
>> +               return;
>> +
>> +       amdgpu_bo_reserve(bo, true);
>> +       amdgpu_bo_kunmap(bo);
>> +       amdgpu_bo_unpin(bo);
>> +       amdgpu_bo_unreserve(bo);
>> +       mem->is_mapped_to_kernel = false; }
>> +
>>   int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd,
>>                                                struct kfd_vm_fault_info *mem)
>>   {
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 21ec8a18cad2..f5506b153aed 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -941,6 +941,8 @@ static void kfd_process_device_free_bos(struct
>kfd_process_device *pdd)
>>                                  peer_pdd->dev->kgd, mem, peer_pdd->drm_priv);
>>                  }
>>
>> +
>> + amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(pdd->dev->kgd, mem);
>> +
>
>That's a general cleanup function for user space allocations and should not be
>abused for stuff like that.
>
>Regards,
>Christian.
>
>>                  amdgpu_amdkfd_gpuvm_free_memory_of_gpu(pdd->dev->kgd, mem,
>>                                                         pdd->drm_priv, NULL);
>>                  kfd_process_device_remove_obj_handle(pdd, id);
>>
>> Regards,
>> Lang
>>
>>> Regards,
>>> Christian.


More information about the amd-gfx mailing list