[PATCH 1/3] drm/scheduler: Add more documentation

Fri Jul 14 09:51:32 UTC 2023

On 14/07/2023 18.47, Christian König wrote:
> Am 14.07.23 um 11:39 schrieb Asahi Lina:
>> On 14/07/2023 17.40, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> Document the implied lifetime rules of the scheduler (or at least the
>>>> intended ones), as well as the expectations of how resource acquisition
>>>> should be handled.
>>>>
>>>> Signed-off-by: Asahi Lina <lina at asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_main.c | 58
>>>> ++++++++++++++++++++++++++++++++--
>>>>     1 file changed, 55 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 7b2bfc10c1a5..1f3bc3606239 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -43,9 +43,61 @@
>>>>      *
>>>>      * The jobs in a entity are always scheduled in the order that
>>>> they were pushed.
>>>>      *
>>>> - * Note that once a job was taken from the entities queue and
>>>> pushed to the
>>>> - * hardware, i.e. the pending queue, the entity must not be
>>>> referenced anymore
>>>> - * through the jobs entity pointer.
>>>> + * Lifetime rules
>>>> + * --------------
>>>> + *
>>>> + * Getting object lifetimes right across the stack is critical to
>>>> avoid UAF
>>>> + * issues. The DRM scheduler has the following lifetime rules:
>>>> + *
>>>> + * - The scheduler must outlive all of its entities.
>>>> + * - Jobs pushed to the scheduler are owned by it, and must only be
>>>> freed
>>>> + *   after the free_job() callback is called.
>>>> + * - Scheduler fences are reference-counted and may outlive the
>>>> scheduler.
>>>
>>>> + * - The scheduler *may* be destroyed while jobs are still in flight.
>>>
>>> That's not correct. The scheduler can only be destroyed after all the
>>> entities serving it have been destroyed as well as all the jobs already
>>> pushed to the hw finished.
>>
>> The point of this series is to change this behavior so I can actually
>> use the scheduler in my use case, and that begins with formally
>> documenting it as Daniel suggested. That is, I need it to be safe for
>> jobs to not be yet complete before the scheduler is destroyed (the
>> entities do get destroyed first, that's the first bullet point).
> 
> Yeah, but you need to document the current situation not how you like it
> to be.

Daniel told me to document how I think it should be, then fix the bugs 
that make it not so. That's what this series does.

>> We already had this discussion. Without this guarantee, I cannot build
>> a reasonable safe Rust abstraction. Unless you have another
>> suggestion, as far as I can tell it's either this or I give up on
>> using the DRM scheduler entirely and reimplement something else on my
>> own.
>>
>>> What might be possible to add is that the hw is still working on the
>>> already pushed jobs, but so far that was rejected as undesirable.
>>
>> Where was this rejected?
> 
> Years ago. Our initial driver suspend/resume design relied on that.
> Turned out not to be a good idea

Times change, maybe it's time to revisit that decision?

~~ Lina