[RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched

Boris Brezillon boris.brezillon at collabora.com
Fri Aug 30 09:37:21 UTC 2024


Hi Christian,

On Fri, 30 Aug 2024 10:14:18 +0200
Christian König <christian.koenig at amd.com> wrote:

> Am 29.08.24 um 19:12 schrieb Boris Brezillon:
> > dma_fence objects created by an entity might outlive the
> > drm_gpu_scheduler this entity was bound to if those fences are retained
> > by other other objects, like a dma_buf resv. This means that
> > drm_sched_fence::sched might be invalid when the resv is walked, which
> > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called.
> >
> > This probably went unnoticed so far, because the drm_gpu_scheduler had
> > the lifetime of the drm_device, so, unless you were removing the device,
> > there were no reasons for the scheduler to be gone before its fences.  
> 
> Nope, that is intentional design. get_timeline_name() is not safe to be 
> called after the fence signaled because that would causes circular 
> dependency problems.

Do you mean the dma_fence layer should not call get_timeline_name()
after it's been signalled (looking at the code/doc, it doesn't seem to
be the case), or do you mean the drm_sched implementation of the fence
interface is wrong and should assume the fence can live longer than its
creator?

> 
> E.g. when you have hardware fences it can happen that fences reference a 
> driver module (for the function printing the name) and the module in 
> turn keeps fences around.
> 
> So you easily end up with a module you can never unload.

On the other hand, I think preventing the module from being unloaded is
the right thing to do, because otherwise the dma_fence_ops might be
gone when they get dereferenced in the release path. That's also a
problem I noticed when I started working on the initial panthor driver
without drm_sched. To solve that I ended up retaining a module ref for
each fence created, and releasing this ref in the
dma_fence_ops::release() function.

drm_sched adds an indirection that allows drivers to not care, but
that's still a problem if you end up unloading drm_sched while some of
its drm_sched_fence fences are owned by external components.

> 
> 
> > With the introduction of a new model where each entity has its own
> > drm_gpu_scheduler instance, this situation is likely to happen every time
> > a GPU context is destroyed and some of its fences remain attached to
> > dma_buf objects still owned by other drivers/processes.
> >
> > In order to make drm_sched_fence_get_timeline_name() safe, we need to
> > copy the scheduler name into our own refcounted object that's only
> > destroyed when both the scheduler and all its fences are gone.
> >
> > The fact drm_sched_fence might have a reference to the drm_gpu_scheduler
> > even after it's been released is worrisome though, but I'd rather
> > discuss that with everyone than come up with a solution that's likely
> > to end up being rejected.
> >
> > Note that the bug was found while repeatedly reading dma_buf's debugfs
> > file, which, at some point, calls dma_resv_describe() on a resv that
> > contains signalled fences coming from a destroyed GPU context.
> > AFAIK, there's nothing invalid there.  
> 
> Yeah but reading debugfs is not guaranteed to crash the kernel.
> 
> On the other hand the approach with a kref'ed string looks rather sane 
> to me. One comment on this below.

There's still the problem I mentioned above (unloading drm_sched can
make things crash). Are there any plans to fix that? The simple option
would be to prevent compiling drm_sched as a module, but that's not an
option because it depends on DRM which is a tristate too. Maybe we
could have drm_sched_fence.o linked statically, just like dma-fence.c
is linked statically to prevent the stub ops from disappearing.
Not sure if drm_sched_fence.c depends on symbols defined in
sched_{main,entity}.c or other parts of the DRM subsystem though.

> > +/**
> > + * struct drm_sched_fence_timeline - Wrapped around the timeline name
> > + *
> > + * This is needed to cope with the fact dma_fence objects created by
> > + * an entity might outlive the drm_gpu_scheduler this entity was bound
> > + * to, making drm_sched_fence::sched invalid and leading to a UAF when
> > + * dma_fence_ops::get_timeline_name() is called.
> > + */
> > +struct drm_sched_fence_timeline {
> > +	/** @kref: Reference count of this timeline object. */
> > +	struct kref			kref;
> > +
> > +	/**
> > +	 * @name: Name of the timeline.
> > +	 *
> > +	 * This is currently a copy of drm_gpu_scheduler::name.
> > +	 */
> > +	const char			*name;  
> 
> Make that a char name[] and embed the name into the structure. The macro 
> struct_size() can be used to calculate the size.

Sure I can do that.

Regards,

Boris


More information about the dri-devel mailing list