[PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Christian König christian.koenig at amd.com
Wed Mar 8 18:12:00 UTC 2023


Am 08.03.23 um 18:32 schrieb Asahi Lina:
> [SNIP]
> Yes but... none of this cleans up jobs that are already submitted by the
> scheduler and in its pending list, with registered completion callbacks,
> which were already popped off of the entities.
>
> *That* is the problem this patch fixes!

Ah! Yes that makes more sense now.

>> We could add a warning when users of this API doesn't do this
>> correctly, but cleaning up incorrect API use is clearly something we
>> don't want here.
> It is the job of the Rust abstractions to make incorrect API use that
> leads to memory unsafety impossible. So even if you don't want that in
> C, it's my job to do that for Rust... and right now, I just can't
> because drm_sched doesn't provide an API that can be safely wrapped
> without weird bits of babysitting functionality on top (like tracking
> jobs outside or awkwardly making jobs hold a reference to the scheduler
> and defer dropping it to another thread).

Yeah, that was discussed before but rejected.

The argument was that upper layer needs to wait for the hw to become 
idle before the scheduler can be destroyed anyway.

>>> Right now, it is not possible to create a safe Rust abstraction for
>>> drm_sched without doing something like duplicating all job tracking in
>>> the abstraction, or the above backreference + deferred cleanup mess, or
>>> something equally silly. So let's just fix the C side please ^^
>> Nope, as far as I can see this is just not correctly tearing down the
>> objects in the right order.
> There's no API to clean up in-flight jobs in a drm_sched at all.
> Destroying an entity won't do it. So there is no reasonable way to do
> this at all...

Yes, this was removed.

>> So you are trying to do something which is not supposed to work in the
>> first place.
> I need to make things that aren't supposed to work impossible to do in
> the first place, or at least fail gracefully instead of just oopsing
> like drm_sched does today...
>
> If you're convinced there's a way to do this, can you tell me exactly
> what code sequence I need to run to safely shut down a scheduler
> assuming all entities are already destroyed? You can't ask me for a list
> of pending jobs (the scheduler knows this, it doesn't make any sense to
> duplicate that outside), and you can't ask me to just not do this until
> all jobs complete execution (because then we either end up with the
> messy deadlock situation I described if I take a reference, or more
> duplicative in-flight job count tracking and blocking in the free path
> of the Rust abstraction, which doesn't make any sense either).

Good question. We don't have anybody upstream which uses the scheduler 
lifetime like this.

Essentially the job list in the scheduler is something we wanted to 
remove because it causes tons of race conditions during hw recovery.

When you tear down the firmware queue how do you handle already 
submitted jobs there?

Regards,
Christian.

>
> ~~ Lina



More information about the dri-devel mailing list