[RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched

Mon Sep 2 14:18:33 UTC 2024

Am 02.09.24 um 15:23 schrieb Daniel Vetter:
> On Mon, Sep 02, 2024 at 12:43:45PM +0200, Christian König wrote:
>> Am 30.08.24 um 23:43 schrieb Matthew Brost:
>>> On Fri, Aug 30, 2024 at 10:14:18AM +0200, Christian König wrote:
>>>> Am 29.08.24 um 19:12 schrieb Boris Brezillon:
>>>>> dma_fence objects created by an entity might outlive the
>>>>> drm_gpu_scheduler this entity was bound to if those fences are retained
>>>>> by other other objects, like a dma_buf resv. This means that
>>>>> drm_sched_fence::sched might be invalid when the resv is walked, which
>>>>> in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called.
>>>>>
>>>>> This probably went unnoticed so far, because the drm_gpu_scheduler had
>>>>> the lifetime of the drm_device, so, unless you were removing the device,
>>>>> there were no reasons for the scheduler to be gone before its fences.
>>>> Nope, that is intentional design. get_timeline_name() is not safe to be
>>>> called after the fence signaled because that would causes circular
>>>> dependency problems.
> So I don't think knowlingly crashing in debugfs is ok. debugfs can break
> stuff like secure boot, and if you go about things very wrongly it can
> upset the kernel (like touching pci mappings from userspace can). But just
> going boom due to a race essentially means debugfs is unusable. Because
> there's no way to avoid the boom with dma_fence:
>
> - they're guaranteed to signal in finite time (unless driver bugs)
>
> - the moment they've signalled looking too closely at them is undefined
>    behaviour.
>
>>> I'm quite sure happens, ftrace for example can and will call
>>> get_timeline_name in trace_dma_fence_destroy which is certainly after
>>> the fence is signaled. There are likely other cases too - this just
>>> quickly came to mind.
>> Good point, completely forgotten about ftrace.
>>
>>>> E.g. when you have hardware fences it can happen that fences reference a
>>>> driver module (for the function printing the name) and the module in turn
>>>> keeps fences around.
>>>>
>>> I am almost positive without this patch this problematic in Xe or any
>>> driver in which schedulers are tied to IOCTLs rather than kernel module.
>>>
>>> In Xe 'fence->sched' maps to an xe_exec_queue which can be freed once
>>> the destroy exec queue IOCTL is called and all jobs are free'd (i.e.
>>> 'fence' signals). The fence could be live on after in dma-resv objects,
>>> drm syncobjs, etc...
>>>
>>> I know this issue has been raised before and basically NACK'd but I have
>>> a strong opinion this is valid and in fact required.
>> I've NACK'd automatically signaling pending fences on destruction of the
>> scheduler (that reminds me that I wanted to add a warning for that) and
>> copying the name into every scheduler fence.
>>
>> As long as we don't do any of that I'm perfectly fine fixing this issue. The
>> approach of creating a reference counted object for the name looks rather
>> valid to me.
>>
>> Especially since we then pretty much get the module references correct for
>> free as well.
> So I think the issue is much, much bigger, and there's more. And the
> issue is I think a fundamental design issue of dma_fence itself, not
> individual users.

IIRC both Alex and me pointed out this issue on the very first dma_fence 
code and nobody really cared.

>   I think at the core it's two constraints:
>
> - dma_fence can stick around practically forever in varios container
>    objects. We only garbage collect when someone looks, and not even then
>    consistently.
>
> - fences are meant to be cheap, so they do not have the big refcount going
>    on like other shared objects like dma_buf
>
> Specifically there's also no refcounting on the module itself with the
> ->owner and try_module_get stuff. So even if we fix all these issues on
> the data structure lifetime side of things, you might still oops calling
> into dma_fence->ops->release.
>
> Oops.

Yes, exactly that. I'm a bit surprised that you realize that only now :)

We have the issue for at least 10 years or so and it pops up every now 
and then on my desk because people complain that unloading amdgpu crashes.

> I think the complete solution is if we change this code all so that core
> dma-fence.c code guarantees to never ever again call into any driver code
> after dma_fence_signal has been called, and takes over the final kfree_rcu
> itself. But that's a giantic change. But I think it's the only way to
> really fix this mess:
>
> - drivers will clean up any of their own references in a timely fashion,
>    so no more accidentally lingering gpu context or vms and the bo they
>    have mapped lying around.
>
> - there's no lifetime or other use-after-free issues anywhere for fences
>    anymore
>
> Downside is that some of the debugging stuff becomes a bit less useful.
> But e.g. tracepoints could just dump the timeline once at creation or when
> signalling, and so you don't need to dump it anymore when freeing. And a
> signalled fence is generally not a problem anymore, so in a compositor
> that's also all fine (iirc you can get at some of this stuff through the
> sync_file interfaces too).
>
> The other downside is that it's a huge pile of work, but I don't think we
> can get to an actually solid design with less headaches and pain ...
>
> Thoughts?

The alternative is to use the scheduler fence(s) to decouple hardware 
fences from the containers. That would be rather cheap to implement.

The only downside would be that the scheduler module probably keeps 
loaded forever once used. But at least I can live with that.

Regards,
Christian.

>
> Cheers, Sima