[Regression] drm/scheduler: track GPU active time per entity

Wed Apr 5 14:05:53 UTC 2023

On 4/4/23 06:31, Luben Tuikov wrote:
> On 2023-03-28 04:54, Lucas Stach wrote:
>> Hi Danilo,
>>
>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>> Hi all,
>>>
>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>> tries to track the accumulated time that a job was active on the GPU
>>> writing it to the entity through which the job was deployed to the
>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>> which fetches a job from the schedulers pending_list.
>>>
>>> Doing this can result in a race condition where the entity is already
>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>> once the job is fetched from the pending_list.
>>>
>>> After drm_sched_entity_destroy() being called it should be safe to free
>>> the structure that embeds the entity. However, a job originally handed
>>> over to the scheduler by this entity might still reside in the
>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>> already being called and the entity being freed. Hence, we can run into
>>> a UAF.
>>>
>> Sorry about that, I clearly didn't properly consider this case.
>>
>>> In my case it happened that a job, as explained above, was just picked
>>> from the schedulers pending_list after the entity was freed due to the
>>> client application exiting. Meanwhile this freed up memory was already
>>> allocated for a subsequent client applications job structure again.
>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>> reproduce the same corruption over and over again by just using
>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>
>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>> I miss something), which is why I'm writing this mail instead of sending
>>> a fix directly.
>>>
>>> Spontaneously, I see three options to fix it:
>>>
>>> 1. Rather than embedding the entity into driver specific structures
>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>> reference count it, such that it's only freed up once all jobs that were
>>> deployed through this entity are fetched from the schedulers pending list.
>>>
>> My vote is on this or something in similar vain for the long term. I
>> have some hope to be able to add a GPU scheduling algorithm with a bit
>> more fairness than the current one sometime in the future, which
>> requires execution time tracking on the entities.
> 
> Danilo,
> 
> Using kref is preferable, i.e. option 1 above.

I think the only real motivation for doing that would be for generically 
tracking job statistics within the entity a job was deployed through. If 
we all agree on tracking job statistics this way I am happy to prepare a 
patch for this option and drop this one: 
https://lore.kernel.org/all/20230331000622.4156-1-dakr@redhat.com/T/#u

Christian mentioned amdgpu tried something similar to what Lucas tried 
running into similar trouble, backed off and implemented it in another 
way - a driver specific way I guess?

> 
> Lucas, can you shed some light on,
> 
> 1. In what way the current FIFO scheduling is unfair, and
> 2. shed some details on this "scheduling algorithm with a bit
> more fairness than the current one"?
> 
> Regards,
> Luben
> 
>>
>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>> jobs deployed through this entity were fetched from the schedulers
>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>
>>> 3. Just revert the change and let drivers implement tracking of GPU
>>> active times themselves.
>>>
>> Given that we are already pretty late in the release cycle and etnaviv
>> being the only driver so far making use of the scheduler elapsed time
>> tracking I think the right short term solution is to either move the
>> tracking into etnaviv or just revert the change for now. I'll have a
>> look at this.
>>
>> Regards,
>> Lucas
>>
>>> In the case of just reverting the change I'd propose to also set a jobs
>>> entity pointer to NULL  once the job was taken from the entity, such
>>> that in case of a future issue we fail where the actual issue resides
>>> and to make it more obvious that the field shouldn't be used anymore
>>> after the job was taken from the entity.
>>>
>>> I'm happy to implement the solution we agree on. However, it might also
>>> make sense to revert the change until we have a solution in place. I'm
>>> also happy to send a revert with a proper description of the problem.
>>> Please let me know what you think.
>>>
>>> - Danilo
>>>
>>
>