[Regression] drm/scheduler: track GPU active time per entity

Tue Apr 4 04:31:35 UTC 2023

On 2023-03-28 04:54, Lucas Stach wrote:
> Hi Danilo,
> 
> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>> Hi all,
>>
>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity") 
>> tries to track the accumulated time that a job was active on the GPU 
>> writing it to the entity through which the job was deployed to the 
>> scheduler originally. This is done within drm_sched_get_cleanup_job() 
>> which fetches a job from the schedulers pending_list.
>>
>> Doing this can result in a race condition where the entity is already 
>> freed, but the entity's newly added elapsed_ns field is still accessed 
>> once the job is fetched from the pending_list.
>>
>> After drm_sched_entity_destroy() being called it should be safe to free 
>> the structure that embeds the entity. However, a job originally handed 
>> over to the scheduler by this entity might still reside in the 
>> schedulers pending_list for cleanup after drm_sched_entity_destroy() 
>> already being called and the entity being freed. Hence, we can run into 
>> a UAF.
>>
> Sorry about that, I clearly didn't properly consider this case.
> 
>> In my case it happened that a job, as explained above, was just picked 
>> from the schedulers pending_list after the entity was freed due to the 
>> client application exiting. Meanwhile this freed up memory was already 
>> allocated for a subsequent client applications job structure again. 
>> Hence, the new jobs memory got corrupted. Luckily, I was able to 
>> reproduce the same corruption over and over again by just using 
>> deqp-runner to run a specific set of VK test cases in parallel.
>>
>> Fixing this issue doesn't seem to be very straightforward though (unless 
>> I miss something), which is why I'm writing this mail instead of sending 
>> a fix directly.
>>
>> Spontaneously, I see three options to fix it:
>>
>> 1. Rather than embedding the entity into driver specific structures 
>> (e.g. tied to file_priv) we could allocate the entity separately and 
>> reference count it, such that it's only freed up once all jobs that were 
>> deployed through this entity are fetched from the schedulers pending list.
>>
> My vote is on this or something in similar vain for the long term. I
> have some hope to be able to add a GPU scheduling algorithm with a bit
> more fairness than the current one sometime in the future, which
> requires execution time tracking on the entities.

Danilo,

Using kref is preferable, i.e. option 1 above.

Lucas, can you shed some light on,

1. In what way the current FIFO scheduling is unfair, and
2. shed some details on this "scheduling algorithm with a bit
more fairness than the current one"? 

Regards,
Luben

> 
>> 2. Somehow make sure drm_sched_entity_destroy() does block until all 
>> jobs deployed through this entity were fetched from the schedulers 
>> pending list. Though, I'm pretty sure that this is not really desirable.
>>
>> 3. Just revert the change and let drivers implement tracking of GPU 
>> active times themselves.
>>
> Given that we are already pretty late in the release cycle and etnaviv
> being the only driver so far making use of the scheduler elapsed time
> tracking I think the right short term solution is to either move the
> tracking into etnaviv or just revert the change for now. I'll have a
> look at this.
> 
> Regards,
> Lucas
> 
>> In the case of just reverting the change I'd propose to also set a jobs 
>> entity pointer to NULL  once the job was taken from the entity, such 
>> that in case of a future issue we fail where the actual issue resides 
>> and to make it more obvious that the field shouldn't be used anymore 
>> after the job was taken from the entity.
>>
>> I'm happy to implement the solution we agree on. However, it might also 
>> make sense to revert the change until we have a solution in place. I'm 
>> also happy to send a revert with a proper description of the problem. 
>> Please let me know what you think.
>>
>> - Danilo
>>
>