[PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero

Errabolu, Ramesh Ramesh.Errabolu at amd.com
Thu Jun 9 14:44:20 UTC 2022


[AMD Official Use Only - General]

My resp in line

Regards,
Ramesh

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com> 
Sent: Thursday, June 9, 2022 2:14 AM
To: Errabolu, Ramesh <Ramesh.Errabolu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero

On 2022-06-08 16:03, Errabolu, Ramesh wrote:
> [AMD Official Use Only - General]
>
> My response is inline.
>
> Regards,
> Ramesh
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Thursday, June 9, 2022 1:10 AM
> To: amd-gfx at lists.freedesktop.org; Errabolu, Ramesh 
> <Ramesh.Errabolu at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only 
> after map count goes to zero
>
>
> On 2022-06-08 07:51, Ramesh Errabolu wrote:
>> In existing code MMIO and DOORBELL BOs are unpinned without ensuring 
>> the condition that their map count has reached zero. Unpinning 
>> without checking this constraint could lead to an error while BO is 
>> being freed. The patch fixes this issue.
>>
>> Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index a1de900ba677..e5dc94b745b1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	mutex_lock(&mem->lock);
>>    
>> -	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> -	if (mem->alloc_flags &
>> -	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> -	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
>> -		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> -	}
>> -
>>    	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
>>    	is_imported = mem->is_imported;
>>    	mutex_unlock(&mem->lock);
>> @@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    	/* Make sure restore workers don't access the BO any more */
>>    	bo_list_entry = &mem->validate_list;
>>    	mutex_lock(&process_info->lock);
>> -	list_del(&bo_list_entry->head);
>> +	list_del_init(&bo_list_entry->head);
> Is this an unrelated fix? What is this needed for? I vaguely remember discussing this before, but can't remember the reason.
>
> Ramesh: This fix is unrelated to P2P work. I brought this issue to attention while working on IOMMU support on DKMS branch. Basically a user could call free() before the map count goes to zero. The patch is trying fix that.

I get that, but I couldn't remember why I suggested list_del_init here. 
It has nothing to do with unpinning of BOs.

Now I recall that it had something to do with restarting the ioctl after it was interrupted by a signal. reserve_bo_and_cond_vms can fail with -ERESTARTSYS. In that case the ioctl is reentered. We need to make sure it doesn't crash the second time around. list_del will remove bo_list_entry from the list but leave the pointers dangling. The second time around it will probably cause corruption or an oops. Using list_del_init avoids that by initializing the prev and next pointers to NULL.

Ramesh: I see the same idiom in the method remove_kgd_mem_from_kfd_bo_list(). Should we be calling this method rather than re-write the same code block. Also the name remove_xyz_kfd_bo_list() is misleading. Should this name be changed.

See one more little fix below.


>
> Regards,
>     Felix
>
>
>>    	mutex_unlock(&process_info->lock);
>>    
>>    	/* No more MMU notifiers */
>> @@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	ret = unreserve_bo_and_vms(&ctx, false, false);

This unreserve_bo_and_vms call cannot fail because the wait parameter is false. If it did fail, the error handling would be broken. I'd add a WARN_ONCE to make that assumption explicit, and change the return at the end of this function to return 0. Basically, if we got this far, we are not turning back, and we should return success.

You could update the commit headline to be more general. Something like: 
Fix error handling in amdgpu_amdkfd_gpuvm_free_memory_of_gpu.

Regards,
   Felix


>>    
>> +	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> +	if (mem->alloc_flags &
>> +	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> +	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
>> +		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> +
>>    	/* Free the sync object */
>>    	amdgpu_sync_free(&mem->sync);
>>    


More information about the amd-gfx mailing list