[PATCH 2/2] dma-buf: cleanup shared fence removal

Daniel Vetter daniel at ffwll.ch
Thu Jun 27 19:57:52 UTC 2019


On Thu, Jun 27, 2019 at 7:20 PM Koenig, Christian
<Christian.Koenig at amd.com> wrote:
>
> Am 27.06.19 um 19:10 schrieb Daniel Vetter:
> > On Thu, Jun 27, 2019 at 03:48:06PM +0000, Koenig, Christian wrote:
> >> Am 27.06.19 um 17:34 schrieb Daniel Vetter:
> >>> On Thu, Jun 27, 2019 at 3:19 PM Christian König
> >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>>> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
> >>>>> On Thu, Jun 27, 2019 at 12:18 PM Christian König
> >>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>>>>> While freeing up memory it is easier to remove a fence from a reservation
> >>>>>> object instead of signaling it and installing a new one in all other objects.
> >>>>>>
> >>>>>> Clean this up by adding the removal function to the core reservation object
> >>>>>> code instead of messing with the internal in amdgpu.
> >>>>>>
> >>>>>> No functional change.
> >>>>>>
> >>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
> >>>>>> ---
> >>>>>>     drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
> >>>>>>     .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
> >>>>>>     include/linux/reservation.h                   |  3 +-
> >>>>>>     3 files changed, 65 insertions(+), 45 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> >>>>>> index ef710effedfa..e43a316a005d 100644
> >>>>>> --- a/drivers/dma-buf/reservation.c
> >>>>>> +++ b/drivers/dma-buf/reservation.c
> >>>>>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>>>>>     }
> >>>>>>     EXPORT_SYMBOL(reservation_object_add_shared_fence);
> >>>>>>
> >>>>>> +/**
> >>>>>> + * reservation_object_remove_shared_fence - remove shared fences
> >>>>>> + * @obj: the reservation object
> >>>>>> + * @context: the context of the fences to remove
> >>>>>> + *
> >>>>>> + * Remove shared fences based on their fence context.
> >>>>>> + */
> >>>>> This needs a serious explanation of "why?". Removing fences without
> >>>>> guaranteeing they're all signaled is a good way to create havoc. Your
> >>>>> commit message has a few words about what you're doing here, but it
> >>>>> still doesn't explain why this is safe and when exactly it should be
> >>>>> used.
> >>>> Yeah, I'm not very keen about this either.
> >>>>
> >>>> The key point is the memory is not accessible by the hardware any more
> >>>> because it is freed and removed from the page tables.
> >>>>
> >>>> So further access is prevented and in this special case it is actually
> >>>> valid to do this even if the operation represented by the fence is still
> >>>> ongoing.
> >>> Hm, why do you have to remove these fences then? Can't you just let
> >>> them signal and get collected as usual? As soon as you share buffers
> >>> these fences can get anywhere, so you need to correctly unwind them no
> >>> matter what.
> >>>
> >>> You're kinda still just describing what you're doing, not why.
> >> It is simply more efficient to remove the fence from one reservation
> >> object than to add a new fence to all other reservation objects in the
> >> same process.
> > Ok, you still talk in riddles and don't explain what the goal of this
> > entire dance is, so I went and read the code. Assuming I didn't misread
> > too much:
> >
> > 1. you create a fake fence on a per-process timeline.
> > 2. you attach this liberally to all the bo you're creating on that
> > process
> > 3. the fence never signals on its own, but it has a very magic
> > ->enable_signaling callback which is the only thing that makes this fence
> > switch to signalled in a finite time. Before that it's stuck forever. So
> > quite a bit a schrödinger fence: It's not a real fence (because it fails
> > to signal in bounded time) except when you start to look at it.
> > 4. Looking at the fence triggers eviction, at that point we replace this
> > magic eviction fence with the next set, reacquire buffers and then unblock
> > the kfd process once everything is in shape again.
> >
> > This is soooooooooooooooooo magic that I really don't think we should
> > encourage people without clue to maybe use this and totally break all
> > fences guarantees.
>
> Yeah, that is correct. But this is completely unrelated to why we want
> to remove the fence.
>
> > If you do want to make sure an optimized version within
> > reservation_object.c, then it should be code which replaces fences iff:
> > - they're the same context
> > - later in the ordering within that context
> > - of the same type (i.e. safe vs shared)
> >
> > That would actually be safe thing to do.
>
> No, that won't work because there is no replacement for the fence in
> question.
>
> See we want to remove the fence because the memory is freed up.
>
> > Also, the above is what I expected when asking "why do you need this", not
> > "we replace fences, its more efficient" I kinda got that from the code :-)
>
> Well I explained the short version why we do this. What you dug up here
> is correct as well, but completely unrelated to removing the fence.
>
> Again, the reason to remove the fence from one reservation object is
> simply that it is faster to remove it from one object than to attach a
> new fence to all other objects.

Hm I guess I was lead astray by the eviction_fence invalidation thing
in enable_signaling, and a few other places that freed the bo right
afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
the fences first and then freeing the bo is kinda pointless.

Now with your insistence that I'm getting something wrong I guess the
you're talking about the unbind case, where the bo survives, but it's
mapping disappears, and hence that specific eviction_fence needs to be
removed.

And yeah there I guess just removing the magic eviction fence is
cheaper than replacing all the others.

Now I guess I understand the mechanics of this somewhat, and what
you're doing, and lit ooks even somewhat safe. But I have no idea what
this is supposed to achieve. It feels a bit like ->notify_move, but
implemented in the most horrible way possible. Or maybe something
else.

Really no idea.

And given that we've wasted I few pages full of paragraphs already on
trying to explain what your new little helper is for, when it's safe
to use, when it's maybe not a good idea, and we still haven't even
bottomed out on what this is for ... well I really don't think it's a
good idea to inflict this into core code. Because just blindly
removing normal fences is not safe.

Especially with like half a sentence of kerneldoc that explains
nothing of all this complexity.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the amd-gfx mailing list