[PATCH 2/2] dma-buf: try to replace a signaled fence in reservation_object_add_shared_inplace

Wed Nov 15 17:43:12 UTC 2017

Quoting Christian König (2017-11-15 17:34:07)
> Am 15.11.2017 um 17:55 schrieb Chris Wilson:
> > Quoting Chris Wilson (2017-11-14 14:34:05)
> >> Quoting Christian König (2017-11-14 14:24:44)
> >>> Am 06.11.2017 um 17:22 schrieb Chris Wilson:
> >>>> Quoting Christian König (2017-10-30 14:59:04)
> >>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj,
> >>>>>                           dma_fence_put(old_fence);
> >>>>>                           return;
> >>>>>                   }
> >>>>> +
> >>>>> +               if (!signaled && dma_fence_is_signaled(old_fence)) {
> >>>>> +                       signaled = old_fence;
> >>>>> +                       signaled_idx = i;
> >>>>> +               }
> >>>> How much do we care about only keeping one fence per-ctx here? You could
> >>>> rearrange this to break on old_fence->context == fence->context ||
> >>>> dma_fence_is_signaled(old_fence) then everyone can use the final block.
> >>> Yeah, that is what David Zhou suggested as well.
> >>>
> >>> I've rejected this approach for now cause I think we still have cases
> >>> where we rely on one fence per ctx (but I'm not 100% sure).
> >>>
> >>> I changed patch #1 in this series as you suggest and going to send that
> >>> out once more in a minute.
> >>>
> >>> Can we get this upstream as is for now? I won't have much more time
> >>> working on this.
> >> Sure, we are only discussing how we might make it look tidier, pure
> >> micro-optimisation with the caveat of losing the one-fence-per-ctx
> >> guarantee.
> > Ah, one thing to note is that extra checking pushed one of our corner
> > case tests over its time limit.
> >
> > If we can completely forgo the one-fence-per-ctx here, what works really
> > well in testing is
> >
> > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> > index 5319ac478918..5755e95fab1b 100644
> > --- a/drivers/dma-buf/reservation.c
> > +++ b/drivers/dma-buf/reservation.c
> > @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj,
> >                                        struct reservation_object_list *fobj,
> >                                        struct dma_fence *fence)
> >   {
> > -       struct dma_fence *replace = NULL;
> > -       u32 ctx = fence->context;
> > -       u32 i;
> > -
> >          dma_fence_get(fence);
> >   
> >          preempt_disable();
> >          write_seqcount_begin(&obj->seq);
> >   
> > -       for (i = 0; i < fobj->shared_count; ++i) {
> > -               struct dma_fence *check;
> > -
> > -               check = rcu_dereference_protected(fobj->shared[i],
> > -                                                 reservation_object_held(obj));
> > -
> > -               if (check->context == ctx || dma_fence_is_signaled(check)) {
> > -                       replace = old_fence;
> > -                       break;
> > -               }
> > -       }
> > -
> >          /*
> >           * memory barrier is added by write_seqcount_begin,
> >           * fobj->shared_count is protected by this lock too
> >           */
> > -       RCU_INIT_POINTER(fobj->shared[i], fence);
> > -       if (!replace)
> > -               fobj->shared_count++;
> > +       RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence);
> >   
> >          write_seqcount_end(&obj->seq);
> >          preempt_enable();
> > -
> > -       dma_fence_put(replace);
> >   }
> >   
> >   static void
> >
> >   i.e. don't check when not replacing the shared[], on creating the new
> >   buffer we then discard all the old fences.
> >
> > It should work for amdgpu as well since you do a ht to coalesce
> > redundant fences before queuing.
> 
> That won't work for all cases. This way the reservation object would 
> keep growing without a chance to ever shrink.

We only keep the active fences when it grows, which is effective enough
to keep it in check on the workloads I can find in the hour since
noticing the failure in CI ;) And on the workloads where it is being
flooded with live fences from many contexts, the order of magnitude
throughput improvement is not easy to ignore.
-Chris