[RFC v7 10/12] drm/sched: Break submission patterns with some randomness
Pierre-Eric Pelloux-Prayer
pierre-eric at damsy.net
Mon Jul 28 09:28:11 UTC 2025
Le 24/07/2025 à 16:19, Tvrtko Ursulin a écrit :
> GPUs generally don't implement preemption and DRM scheduler definitely
> does not support it at the front end scheduling level. This means
> execution quanta can be quite long and is controlled by userspace,
> consequence of which is picking the "wrong" entity to run can have a
> larger negative effect than it would have with a virtual runtime based CPU
> scheduler.
>
> Another important consideration is that rendering clients often have
> shallow submission queues, meaning they will be entering and exiting the
> scheduler's runnable queue often.
>
> Relevant scenario here is what happens when an entity re-joins the
> runnable queue with other entities already present. One cornerstone of the
> virtual runtime algorithm is to let it re-join at the head and depend on
> the virtual runtime accounting to sort out the order after an execution
> quanta or two.
>
> However, as explained above, this may not work fully reliably in the GPU
> world. Entity could always get to overtake the existing entities, or not,
> depending on the submission order and rbtree equal key insertion
> behaviour.
>
> We can break this latching by adding some randomness for this specific
> corner case.
>
> If an entity is re-joining the runnable queue, was head of the queue the
> last time it got picked, and there is an already queued different entity
> of an equal scheduling priority, we can break the tie by randomly choosing
> the execution order between the two.
>
> For randomness we implement a simple driver global boolean which selects
> whether new entity will be first or not. Because the boolean is global and
> shared between all the run queues and entities, its actual effect can be
> loosely called random. Under the assumption it will not always be the same
> entity which is re-joining the queue under these circumstances.
>
> Another way to look at this is that it is adding a little bit of limited
> random round-robin behaviour to the fair scheduling algorithm.
>
> Net effect is a significant improvemnt to the scheduling unit tests which
> check the scheduling quality for the interactive client running in
> parallel with GPU hogs.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: Danilo Krummrich <dakr at kernel.org>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Philipp Stanner <phasta at kernel.org>
> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer at amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_rq.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> index d16ee3ee3653..087a6bdbb824 100644
> --- a/drivers/gpu/drm/scheduler/sched_rq.c
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -147,6 +147,16 @@ drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
> * Higher priority can go first.
> */
> vruntime = -us_to_ktime(rq_prio - prio);
> + } else {
> + static const int shuffle[2] = { -100, 100 };
> + static bool r = 0;
> +
> + /*
> + * For equal priority apply some randomness to break
> + * latching caused by submission patterns.
> + */
> + vruntime = shuffle[r];
> + r ^= 1;
I don't understand why this is needed at all?
I suppose this is related to how drm_sched_entity_save_vruntime saves a relative vruntime (= entity
rejoins with a 0 runtime would be impossible otherwise) but I don't understand this either.
Pierre-Eric
> }
> }
>
More information about the dri-devel
mailing list