[PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
Philipp Stanner
phasta at mailbox.org
Fri Apr 11 14:10:29 UTC 2025
On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
> Am 11.04.25 um 14:44 schrieb Philipp Stanner:
> > On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
> > > Am 11.04.25 um 11:29 schrieb Philipp Stanner:
> > >
> > > > [SNIP]
> > > >
> > > > It could be, however, that at the same moment
> > > > nouveau_fence_signal() is
> > > > removing that entry, holding the appropriate lock.
> > > >
> > > > So we have a race. Again.
> > > >
> > >
> > > Ah, yes of course. If signaled is called with or without the
> > > lock is
> > > actually undetermined.
> > >
> > >
> > > >
> > > > You see, fixing things in Nouveau is difficult :)
> > > > It gets more difficult if you want to clean it up "properly",
> > > > so it
> > > > conforms to rules such as those from dma_fence.
> > > >
> > > > I have now provided two fixes that both work, but you are not
> > > > satisfied
> > > > with from the dma_fence-maintainer's perspective. I understand
> > > > that,
> > > > but please also understand that it's actually not my primary
> > > > task
> > > > to
> > > > work on Nouveau. I just have to fix this bug to move on with my
> > > > scheduler work.
> > > >
> > >
> > > Well I'm happy with whatever solution as long as it works, but
> > > as
> > > far as I can see the approach with the callback simply doesn't.
> > >
> > > You just can't drop the fence reference for the list from the
> > > callback.
> > >
> > >
> > > >
> > > > So if you have another idea, feel free to share it. But I'd
> > > > like to
> > > > know how we can go on here.
> > > >
> > >
> > > Well the fence code actually works, doesn't it? The problem is
> > > rather that setting the error throws a warning because it doesn't
> > > expect signaled fences on the pending list.
> > >
> > > Maybe we should fix that instead.
> > The fence code works as the author intended, but I would be happy
> > if it
> > were more explicitly documented.
> >
> > Regarding the WARN_ON: It occurs in dma_fence_set_error() because
> > there
> > is an attempt to set an error code on a signaled fence. I don't
> > think
> > that should be "fixed", it works as intended: You must not set an
> > error
> > code of a fence that was already signaled.
> >
> > The reason seems to be that once a fence is signaled, a third party
> > might evaluate the error code.
>
> Yeah, more or less correct. The idea is you can't declare an
> operation as having an error after the operation has already
> completed.
>
> Because everyone will just wait for the completion and nobody checks
> the status again after that.
>
> >
> > But I think this wasn't wat you meant with "fix".
>
> The idea was to avoid calling dma_fence_set_error() on already
> signaled fences. Something like this:
>
> @@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct
> nouveau_fence_chan *fctx, int error)
> while (!list_empty(&fctx->pending)) {
> fence = list_entry(fctx->pending.next,
> typeof(*fence), head);
>
> - if (error)
> + if (error & !dma_fence_is_signaled_locked(&fence-
> >base))
> dma_fence_set_error(&fence->base, error);
>
> if (nouveau_fence_signal(fence))
>
> That would also improve the handling quite a bit since we now don't
> set errors on fences which are already completed even if we haven't
> realized that they are already completed yet.
>
> > In any case, there must not be signaled fences in nouveau's
> > pending-
> > list. They must be removed immediately once they signal, and this
> > must
> > not race.
>
> Why actually? As far as I can see the pending list is not for the
> unsignaled fences, but rather the pending interrupt processing.
That's a list of fences that are "in the air", i.e., whose jobs are
currently being processed by the hardware. Once a job is done, its
fence must be removed.
>
> So having signaled fences on the pending list is perfectly possible.
It is possible, and that is a bug. The list is used by
nouveau_fence_context_kill() to kill still pending jobs. It shall not
try to kill and set error codes for fences that are already signaled.
Anyways, forget about the "remove callbacks solution" it actually
causes a MASSIVE performance regression. No idea why, AFAICS the fast
path is only ever evaluated in nouveau_fence_done(), but maybe I missed
something.
Will re-iterate next week…
P.
>
> Regards,
> Christian.
>
> >
> > >
> > >
> > > >
> > > > I'm running out of ideas. What I'm wondering if we couldn't
> > > > just
> > > > remove
> > > > performance hacky fastpath functions such as
> > > > nouveau_fence_is_signaled() completely. It seems redundant to
> > > > me.
> > > >
> > >
> > > That would work for me as well.
> > I'll test this approach. Seems a bit like the nuclear approach, but
> > if
> > it works we'd at least clean up a lot of this mess.
> >
> >
> > P.
> >
> >
> > >
> > >
> > > >
> > > >
> > > > Or we might add locking to it, but IDK what was achieved with
> > > > RCU
> > > > here.
> > > > In any case it's definitely bad that Nouveau has so many
> > > > redundant
> > > > and
> > > > half-redundant mechanisms.
> > > >
> > >
> > > Yeah, agree messing with the locks even more won't help us here.
> > >
> > > Regards,
> > > Christian.
> > >
> > >
> > > >
> > > >
> > > >
> > > > P.
> > > >
> > > >
> > > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > P.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Replace the call to dma_fence_is_signaled() with
> > > > > > > > > nouveau_fence_base_is_signaled().
> > > > > > > > >
> > > > > > > > > Cc: <stable at vger.kernel.org> # 4.10+, precise commit
> > > > > > > > > not
> > > > > > > > > to
> > > > > > > > > be
> > > > > > > > > determined
> > > > > > > > > Signed-off-by: Philipp Stanner <phasta at kernel.org>
> > > > > > > > > ---
> > > > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct
> > > > > > > > > nouveau_fence
> > > > > > > > > *fence)
> > > > > > > > > nvif_event_block(&fctx-
> > > > > > > > > >event);
> > > > > > > > > spin_unlock_irqrestore(&fctx->lock,
> > > > > > > > > flags);
> > > > > > > > > }
> > > > > > > > > - return dma_fence_is_signaled(&fence->base);
> > > > > > > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > > > > > > > &fence-
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > base.flags);
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > static long
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
>
More information about the dri-devel
mailing list