[PATCH] drm/amdgpu: Fix a NULL pointer of fence

Pan, Xinhui Xinhui.Pan at amd.com
Fri Jul 8 01:08:39 UTC 2022


[AMD Official Use Only - General]

Felix,
Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.

And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.

So I would make a new patch to remove this duplicated ef removal.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com>
Sent: Thursday, July 7, 2022 11:47 PM
To: Christian König <ckoenig.leichtzumerken at gmail.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

Am 2022-07-07 um 05:54 schrieb Christian König:
> Am 07.07.22 um 11:50 schrieb xinhui pan:
>> Fence is accessed by dma_resv_add_fence() now.
>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>
>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 0036c9e405af..1e25c400ce4f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>> amdgpu_device *adev,
>>         if (!process_info)
>>           return;
>> -
>>       /* Release eviction fence from PD */
>>       amdgpu_bo_reserve(pd, false);
>> -    amdgpu_bo_fence(pd, NULL, false);
>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>> +                    process_info->eviction_fence);
>
> Good catch as well, but Felix needs to take a look at this.

This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.

You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with

     echo Y > /sys/module/amdgpu/parameters/debug_evictions

Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>       amdgpu_bo_unreserve(pd);
>>         /* Update process info */
>


More information about the amd-gfx mailing list