[PATCH] drm/amdgpu: Fix a NULL pointer of fence
Mike Lothian
mike at fireburn.co.uk
Mon Jul 18 14:58:38 UTC 2022
Is this likely to land before 5.19 final? It's been nearly 2 weeks
since I said if fixed an issue I was seeing
https://gitlab.freedesktop.org/drm/amd/-/issues/2074
On Fri, 8 Jul 2022 at 10:05, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Hi guys,
>
> well the practice to remove all fences by adding a NULL exclusive fence
> was pretty much illegal in the first place because this also removes
> kernel internal fences which can lead to freeing up memory which is
> still accessed.
>
> I've just didn't noticed that this was used by the KFD code as well
> otherwise I would have pushed to clean that up much earlier.
>
> Regards,
> Christian.
>
> Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
> > [AMD Official Use Only - General]
> >
> > Felix,
> > Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
> > From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
> >
> > And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
> > I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
> >
> > So I would make a new patch to remove this duplicated ef removal.
> >
> > -----Original Message-----
> > From: Kuehling, Felix <Felix.Kuehling at amd.com>
> > Sent: Thursday, July 7, 2022 11:47 PM
> > To: Christian König <ckoenig.leichtzumerken at gmail.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; amd-gfx at lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
> >
> > Am 2022-07-07 um 05:54 schrieb Christian König:
> >> Am 07.07.22 um 11:50 schrieb xinhui pan:
> >>> Fence is accessed by dma_resv_add_fence() now.
> >>> Use amdgpu_amdkfd_remove_eviction_fence instead.
> >>>
> >>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
> >>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> index 0036c9e405af..1e25c400ce4f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
> >>> amdgpu_device *adev,
> >>> if (!process_info)
> >>> return;
> >>> -
> >>> /* Release eviction fence from PD */
> >>> amdgpu_bo_reserve(pd, false);
> >>> - amdgpu_bo_fence(pd, NULL, false);
> >>> + amdgpu_amdkfd_remove_eviction_fence(pd,
> >>> + process_info->eviction_fence);
> >> Good catch as well, but Felix needs to take a look at this.
> > This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
> >
> > You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
> >
> > echo Y > /sys/module/amdgpu/parameters/debug_evictions
> >
> > Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
> >
> > Regards,
> > Felix
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> amdgpu_bo_unreserve(pd);
> >>> /* Update process info */
>
More information about the amd-gfx
mailing list