[PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
Philipp Stanner
phasta at mailbox.org
Mon Apr 14 08:54:25 UTC 2025
On Fri, 2025-04-11 at 16:10 +0200, Philipp Stanner wrote:
> On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
> > Am 11.04.25 um 14:44 schrieb Philipp Stanner:
> > > On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
> > > > Am 11.04.25 um 11:29 schrieb Philipp Stanner:
> > > >
> > > > > [SNIP]
> > > > >
> > > > > It could be, however, that at the same moment
> > > > > nouveau_fence_signal() is
> > > > > removing that entry, holding the appropriate lock.
> > > > >
> > > > > So we have a race. Again.
> > > > >
> > > >
> > > > Ah, yes of course. If signaled is called with or without the
> > > > lock is
> > > > actually undetermined.
> > > >
> > > >
> > > > >
> > > > > You see, fixing things in Nouveau is difficult :)
> > > > > It gets more difficult if you want to clean it up "properly",
> > > > > so it
> > > > > conforms to rules such as those from dma_fence.
> > > > >
> > > > > I have now provided two fixes that both work, but you are not
> > > > > satisfied
> > > > > with from the dma_fence-maintainer's perspective. I
> > > > > understand
> > > > > that,
> > > > > but please also understand that it's actually not my primary
> > > > > task
> > > > > to
> > > > > work on Nouveau. I just have to fix this bug to move on with
> > > > > my
> > > > > scheduler work.
> > > > >
> > > >
> > > > Well I'm happy with whatever solution as long as it works, but
> > > > as
> > > > far as I can see the approach with the callback simply doesn't.
> > > >
> > > > You just can't drop the fence reference for the list from the
> > > > callback.
> > > >
> > > >
> > > > >
> > > > > So if you have another idea, feel free to share it. But I'd
> > > > > like to
> > > > > know how we can go on here.
> > > > >
> > > >
> > > > Well the fence code actually works, doesn't it? The problem is
> > > > rather that setting the error throws a warning because it
> > > > doesn't
> > > > expect signaled fences on the pending list.
> > > >
> > > > Maybe we should fix that instead.
> > > The fence code works as the author intended, but I would be happy
> > > if it
> > > were more explicitly documented.
> > >
> > > Regarding the WARN_ON: It occurs in dma_fence_set_error() because
> > > there
> > > is an attempt to set an error code on a signaled fence. I don't
> > > think
> > > that should be "fixed", it works as intended: You must not set an
> > > error
> > > code of a fence that was already signaled.
> > >
> > > The reason seems to be that once a fence is signaled, a third
> > > party
> > > might evaluate the error code.
> >
> > Yeah, more or less correct. The idea is you can't declare an
> > operation as having an error after the operation has already
> > completed.
> >
> > Because everyone will just wait for the completion and nobody
> > checks
> > the status again after that.
> >
> > >
> > > But I think this wasn't wat you meant with "fix".
> >
> > The idea was to avoid calling dma_fence_set_error() on already
> > signaled fences. Something like this:
> >
> > @@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct
> > nouveau_fence_chan *fctx, int error)
> > while (!list_empty(&fctx->pending)) {
> > fence = list_entry(fctx->pending.next,
> > typeof(*fence), head);
> >
> > - if (error)
> > + if (error & !dma_fence_is_signaled_locked(&fence-
> > > base))
> > dma_fence_set_error(&fence->base, error);
> >
> > if (nouveau_fence_signal(fence))
> >
> > That would also improve the handling quite a bit since we now don't
> > set errors on fences which are already completed even if we haven't
> > realized that they are already completed yet.
> >
> > > In any case, there must not be signaled fences in nouveau's
> > > pending-
> > > list. They must be removed immediately once they signal, and this
> > > must
> > > not race.
> >
> > Why actually? As far as I can see the pending list is not for the
> > unsignaled fences, but rather the pending interrupt processing.
>
> That's a list of fences that are "in the air", i.e., whose jobs are
> currently being processed by the hardware. Once a job is done, its
> fence must be removed.
>
> >
> > So having signaled fences on the pending list is perfectly
> > possible.
>
> It is possible, and that is a bug. The list is used by
> nouveau_fence_context_kill() to kill still pending jobs. It shall not
> try to kill and set error codes for fences that are already signaled.
@Danilo:
We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian)
Check in nouveau_fence_context_kill() whether a fence is already
signaled before setting an error.
Version B (Me)
This patch series here. Make sure that in Nouveau, only
nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can
move on here.
Thx
P.
>
>
>
> Anyways, forget about the "remove callbacks solution" it actually
> causes a MASSIVE performance regression. No idea why, AFAICS the fast
> path is only ever evaluated in nouveau_fence_done(), but maybe I
> missed
> something.
>
> Will re-iterate next week…
>
>
> P.
>
>
> >
> > Regards,
> > Christian.
> >
> > >
> > > >
> > > >
> > > > >
> > > > > I'm running out of ideas. What I'm wondering if we couldn't
> > > > > just
> > > > > remove
> > > > > performance hacky fastpath functions such as
> > > > > nouveau_fence_is_signaled() completely. It seems redundant to
> > > > > me.
> > > > >
> > > >
> > > > That would work for me as well.
> > > I'll test this approach. Seems a bit like the nuclear approach,
> > > but
> > > if
> > > it works we'd at least clean up a lot of this mess.
> > >
> > >
> > > P.
> > >
> > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > Or we might add locking to it, but IDK what was achieved with
> > > > > RCU
> > > > > here.
> > > > > In any case it's definitely bad that Nouveau has so many
> > > > > redundant
> > > > > and
> > > > > half-redundant mechanisms.
> > > > >
> > > >
> > > > Yeah, agree messing with the locks even more won't help us
> > > > here.
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > P.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > P.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Replace the call to dma_fence_is_signaled() with
> > > > > > > > > > nouveau_fence_base_is_signaled().
> > > > > > > > > >
> > > > > > > > > > Cc: <stable at vger.kernel.org> # 4.10+, precise
> > > > > > > > > > commit
> > > > > > > > > > not
> > > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > determined
> > > > > > > > > > Signed-off-by: Philipp Stanner <phasta at kernel.org>
> > > > > > > > > > ---
> > > > > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct
> > > > > > > > > > nouveau_fence
> > > > > > > > > > *fence)
> > > > > > > > > > nvif_event_block(&fctx-
> > > > > > > > > > > event);
> > > > > > > > > > spin_unlock_irqrestore(&fctx-
> > > > > > > > > > >lock,
> > > > > > > > > > flags);
> > > > > > > > > > }
> > > > > > > > > > - return dma_fence_is_signaled(&fence-
> > > > > > > > > > >base);
> > > > > > > > > > + return
> > > > > > > > > > test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > > > > > > > > &fence-
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > base.flags);
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > static long
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> >
>
More information about the dri-devel
mailing list