[Regression] drm/scheduler: track GPU active time per entity

Wed Apr 5 20:44:33 UTC 2023

On 2023-04-05 13:44, Lucas Stach wrote:
> Hi Luben,
> 
> Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
>> On 2023-03-28 04:54, Lucas Stach wrote:
>>> Hi Danilo,
>>>
>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>> Hi all,
>>>>
>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity") 
>>>> tries to track the accumulated time that a job was active on the GPU 
>>>> writing it to the entity through which the job was deployed to the 
>>>> scheduler originally. This is done within drm_sched_get_cleanup_job() 
>>>> which fetches a job from the schedulers pending_list.
>>>>
>>>> Doing this can result in a race condition where the entity is already 
>>>> freed, but the entity's newly added elapsed_ns field is still accessed 
>>>> once the job is fetched from the pending_list.
>>>>
>>>> After drm_sched_entity_destroy() being called it should be safe to free 
>>>> the structure that embeds the entity. However, a job originally handed 
>>>> over to the scheduler by this entity might still reside in the 
>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy() 
>>>> already being called and the entity being freed. Hence, we can run into 
>>>> a UAF.
>>>>
>>> Sorry about that, I clearly didn't properly consider this case.
>>>
>>>> In my case it happened that a job, as explained above, was just picked 
>>>> from the schedulers pending_list after the entity was freed due to the 
>>>> client application exiting. Meanwhile this freed up memory was already 
>>>> allocated for a subsequent client applications job structure again. 
>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to 
>>>> reproduce the same corruption over and over again by just using 
>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>
>>>> Fixing this issue doesn't seem to be very straightforward though (unless 
>>>> I miss something), which is why I'm writing this mail instead of sending 
>>>> a fix directly.
>>>>
>>>> Spontaneously, I see three options to fix it:
>>>>
>>>> 1. Rather than embedding the entity into driver specific structures 
>>>> (e.g. tied to file_priv) we could allocate the entity separately and 
>>>> reference count it, such that it's only freed up once all jobs that were 
>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>
>>> My vote is on this or something in similar vain for the long term. I
>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>> more fairness than the current one sometime in the future, which
>>> requires execution time tracking on the entities.
>>
>> Danilo,
>>
>> Using kref is preferable, i.e. option 1 above.
>>
>> Lucas, can you shed some light on,
>>
>> 1. In what way the current FIFO scheduling is unfair, and
>> 2. shed some details on this "scheduling algorithm with a bit
>> more fairness than the current one"? 
> 
> I don't have a specific implementation in mind yet. However the current
> FIFO algorithm can be very unfair if you have a sparse workload compete
> with one that generates a lot of jobs without any throttling aside from
> the entity queue length.

Ah, that's interesting, let's see, a "sparse workload compete with one that
generates a lot of jobs", so basically we have a sparse workload compete
with a dense workload. So we can represent this with two entities, A, B,
whose jobs we're going to represent by the entities, names A and B.
So if we let A be the sparse workload and B the dense workload,
we have this, wlog,

      First/oldest job, .........................., latter/new jobs.
Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, .....

The current FIFO algorithm, would prefer to execute those jobs
in order of submission, i.e. oldest-ready-first job. Assume
that all jobs are ready. Then we'll execute them in order.
This is desirable and fair. We want to execute the jobs
in the order they were submitted, given also that they are
ready to be executed. So perhaps we want to execute them like this:

      First/oldest job, .........................., latter/new jobs.
Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, ....
Exec:          A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...   

Any other ordering would starve either A, or B. If we executed the 2nd A
job at t6 or t7, then that would starve the 3rd/4th job in B, since the 2nd A job
arrives at the same time as that of the 3rd B job, at time t6.
The time t3-t0 is some delta > 0, some initial scheduler-start up time.

IOW, we don't want to delay a job any more than it should wait--the oldest
job, which is also ready, should execute next, so that we're fair how
it executes in real time. We cannot boot B at t6, so that we execute A,
just because it is sparse, but just arrived.

>From A's point of view, it shouldn't expect its job execution time distribution
to be any different than its submission time distribution.

Do you think there's a job permutation which offers a fairer scheduling
than the Exec line above for the Submission line above?

> By tracking the actual GPU time consumed by
> the entities we could implement something with a bit more fairness like
> deficit round robin (don't pin me on the specific algorithm, as I
> haven't given it much thought yet).

Since there's no preemption, this would be hard to achieve--you're at the mercy
of the execution time of job A_i for an entity A job i. (Assuming there's no
preemption as it is the current state of the GPU scheduler.)

The only thing you can do, is punish the next job from this entity, A_i+1,
to execute much later. However, you don't know how long A_i+1 would take. If A_i+1
takes very little time, then you're better off executing it at the next opportune
time, i.e. when it would normally execute. But such an algorithm, which doesn't
know a priori the execution time of a job, would punish A_i+1 to execute much later.

But if A_i+1 takes time as long or longer than A_i, then punishing it to execute much
later, would simply delay it, from an observer's point a view, it wouldn't teach
the context to submit smaller jobs, so that GPU sharing is more fair.

(Note that if we know the job's time a priori, we could do something like bin packing
to accommodate fair scheduling over the long run.)

One way to partially remedy the situation is parallelism. The more parallel execution
units (schedulers) you have, the more the alleviation. We'd get in trouble iff we get
all jobs executing in all schedulers, each taking a long time, with probability going
to 0 as you increase the parallel execution units (not considering a pathological
execution time distribution, where all jobs take a long time universally.)

Thinking about the FIFO discussion above: even if one job in a sparse-with-dense load
distributions from a number of context, takes a long long time to execute, the very next
job you'd want to execute is the oldest (the one who's been waiting to most) ready
job--why delay it any further--that'll starve it.
-- 
Regards,
Luben