[RFC 0/4] Some (drm_sched_|dma_)fence lifetime issues
Matthew Brost
matthew.brost at intel.com
Thu Apr 24 06:11:48 UTC 2025
On Wed, Apr 23, 2025 at 03:12:27PM +0200, Christian König wrote:
> On 4/18/25 18:42, Tvrtko Ursulin wrote:
> > Hi all,
> >
> > Recently I mentioned to Danilo about some fence lifetime issues so here is a
> > rough series, more than anything intended to start the discussion.
> >
> > Most of the problem statement can be found in the first patch but to briefly
> > summarise - because sched fence can outlive the scheduler, we can trivially
> > engineer an use after free with xe and possibly other drivers. All that is
> > needed is to convert a syncobj into a sync file behind drivers back, and I don't
> > see what the driver can do about it.
>
>
> Yeah that topic again :) The problem here is that this is not a bug, it is a feature!
>
Really more unforeseen design flaw IMO - it happens.
> IIRC it was Alex who pointed that issue out on the very first fence patch set, and we already discussed what to do back then.
>
> The problem with grabbing module references for fences is that you get trivially into circle references and so basically always preventing the module from unloading.
>
Yea, the Xe patch holding module ref is a no-go.
> The decision was made to postpone this and live with the potential use after free on module unload until somebody has time to fix it. Well that was +10 years ago :)
>
> I discussed this with Sima again last year and we came to the conclusion that the easiest way forward would be to decouple the dma_fence implementation from the driver or component issuing the fence.
>
> I then came up with the following steps to allow this:
> 1. Decouple the lock used for protecting the dma_fence callback list from the caller.
> 2. Stop calling enable_signaling with the lock held.
> 3. Nuke all those kmem_cache implementations and force drivers to always allocate fences using kvmalloc().
Let's document this too.
Xe is an offender, I'll post a fix tomorrow.
> 4. Nuke the release callback (or maybe move it directly after signaling) and set fence->ops to NULL after signaling the fence.
>
> I already send patches out for #1 and #2, but don't have enough time to actually finish the work.
>
Link? This has been lingering for a while perhaps the community can pick this up.
Matt
> If you want take a look at nuking all those kmem_cache implementations for allocating the fence memory. I think that can be completed completely separate to everything else.
>
> Regards,
> Christian.
>
>
> >
> > IGT that exploits the problem:
> > https://patchwork.freedesktop.org/patch/642709/?series=146211&rev=2
> >
> > Different flavour of the problem space is if we had a close(drm_fd) in that test
> > before the sleep. In that case we can even unload xe.ko and gpu-sched.ko for
> > even more fun. Last two patches in the series close that gap.
> >
> > But first two patches are just shrinking the race window. They are not proper
> > fixes. This is what I want to discuss since I understand reference counting all
> > the involved objects has been rejected in the past. And since the problem
> > probably expands to all dma fences it certainly isn't easy.
> >
> > To be clear once more - lets not focus on how this does not fix it fully - I am
> > primarily trying to start the conversation.
> >
> > Cc: Christian König <christian.koenig at amd.com>
> > Cc: Danilo Krummrich <dakr at kernel.org>
> > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Philipp Stanner <phasta at kernel.org>
> > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >
> > Tvrtko Ursulin (4):
> > sync_file: Weakly paper over one use-after-free resulting race
> > dma-fence: Slightly safer dma_fence_set_deadline
> > drm/sched: Keep module reference while there are active fences
> > drm/xe: Keep module reference while there are active fences
> >
> > drivers/dma-buf/dma-fence.c | 2 +-
> > drivers/dma-buf/sync_file.c | 29 ++++++++++++++++++++-----
> > drivers/gpu/drm/scheduler/sched_fence.c | 12 ++++++++--
> > drivers/gpu/drm/xe/xe_hw_fence.c | 13 ++++++++++-
> > 4 files changed, 47 insertions(+), 9 deletions(-)
> >
>
More information about the dri-devel
mailing list