[PATCH] drm/amdgpu: Fix a NULL pointer of fence

Christian König ckoenig.leichtzumerken at gmail.com
Fri Jul 8 09:03:59 UTC 2022


Hi guys,

well the practice to remove all fences by adding a NULL exclusive fence 
was pretty much illegal in the first place because this also removes 
kernel internal fences which can lead to freeing up memory which is 
still accessed.

I've just didn't noticed that this was used by the KFD code as well 
otherwise I would have pushed to clean that up much earlier.

Regards,
Christian.

Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Felix,
> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
>  From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
>
> And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
>
> So I would make a new patch to remove this duplicated ef removal.
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Thursday, July 7, 2022 11:47 PM
> To: Christian König <ckoenig.leichtzumerken at gmail.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
>
> Am 2022-07-07 um 05:54 schrieb Christian König:
>> Am 07.07.22 um 11:50 schrieb xinhui pan:
>>> Fence is accessed by dma_resv_add_fence() now.
>>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 0036c9e405af..1e25c400ce4f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>>> amdgpu_device *adev,
>>>          if (!process_info)
>>>            return;
>>> -
>>>        /* Release eviction fence from PD */
>>>        amdgpu_bo_reserve(pd, false);
>>> -    amdgpu_bo_fence(pd, NULL, false);
>>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>>> +                    process_info->eviction_fence);
>> Good catch as well, but Felix needs to take a look at this.
> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
>
> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
>
>       echo Y > /sys/module/amdgpu/parameters/debug_evictions
>
> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
>
> Regards,
>     Felix
>
>
>> Regards,
>> Christian.
>>
>>>        amdgpu_bo_unreserve(pd);
>>>          /* Update process info */



More information about the amd-gfx mailing list