[PATCH] drm/scheduler: set current_entity to next when remove from rq

Luben Tuikov luben.tuikov at amd.com
Thu Oct 27 15:46:35 UTC 2022


So, I started fixing this, including the bug taking the next element as an entity, but it could be actually the list_head... a la your patch being fixed, and then went down the rabbit whole of also fixing drm_sched_rq_select_entity(), but the problem is that at that point we don't know if we should start from the _next_ entity (as it is currently the case) or from the current entity (a la list_for_each_entry_from()) as it would be the case with this patch (if it were fixed for the list_head bug).

But the problem is that elsewhere in the GPU scheduler (sched_entity.c), we do want to start from rq->current_entity->next, and picking "next" in drm_sched_rq_remove_entity(), would then skip an entity, or be biased for an entity twice. This is why this function is called drm_sched_rq_remove_entity() and not drm_sched_rq_next_entity_or_null().

So all this work seemed moot, given that we've already switched to FIFO-based scheduling in drm-misc-next, and so I didn't see a point in developing this further at this point (it's been working alright)--we're going with FIFO-based scheduling.

Regards,
Luben


On 2022-10-27 05:08, Christian König wrote:
> Am 27.10.22 um 11:00 schrieb broler Liew:
>> It's very nice of you-all to finger it out that it may crash when it is the last entity in the list.   Absolutely I made a mistake about that.
>> But I still cannot understand why we need to restart the selection from the list head when the current entity is removed from rq.
>> In drm_sched_rq_select_entity, starting from head may cause the first entity to be selected more often than others, which breaks the equal rule the scheduler wants to achieve.
>> Maybe the previous one is the better choice when current_entity == entity?
> 
> That's a good argument, but we want to get rid of the round robin algorithm anyway and switch over to the fifo.
> 
> So this is some code which is already not used by default any more and improving it doesn't make much sense.
> 
> Regards,
> Christian.
> 
>>
>> Luben Tuikov <luben.tuikov at amd.com> 于2022年10月27日周四 16:24写道:
>>
>>     On 2022-10-27 04:19, Christian König wrote:
>>     > Am 27.10.22 um 10:07 schrieb Luben Tuikov:
>>     >> On 2022-10-27 03:01, Luben Tuikov wrote:
>>     >>> On 2022-10-25 13:50, Luben Tuikov wrote:
>>     >>>> Looking...
>>     >>>>
>>     >>>> Regards,
>>     >>>> Luben
>>     >>>>
>>     >>>> On 2022-10-25 09:35, Alex Deucher wrote:
>>     >>>>> + Luben
>>     >>>>>
>>     >>>>> On Tue, Oct 25, 2022 at 2:55 AM brolerliew <brolerliew at gmail.com> wrote:
>>     >>>>>> When entity move from one rq to another, current_entity will be set to NULL
>>     >>>>>> if it is the moving entity. This make entities close to rq head got
>>     >>>>>> selected more frequently, especially when doing load balance between
>>     >>>>>> multiple drm_gpu_scheduler.
>>     >>>>>>
>>     >>>>>> Make current_entity to next when removing from rq.
>>     >>>>>>
>>     >>>>>> Signed-off-by: brolerliew <brolerliew at gmail.com>
>>     >>>>>> ---
>>     >>>>>>   drivers/gpu/drm/scheduler/sched_main.c | 5 +++--
>>     >>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>     >>>>>>
>>     >>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>     >>>>>> index 2fab218d7082..00b22cc50f08 100644
>>     >>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>     >>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>     >>>>>> @@ -168,10 +168,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>     >>>>>>          spin_lock(&rq->lock);
>>     >>>>>>
>>     >>>>>>          atomic_dec(rq->sched->score);
>>     >>>>>> -       list_del_init(&entity->list);
>>     >>>>>>
>>     >>>>>>          if (rq->current_entity == entity)
>>     >>>>>> -               rq->current_entity = NULL;
>>     >>>>>> +               rq->current_entity = list_next_entry(entity, list);
>>     >>>>>> +
>>     >>>>>> +       list_del_init(&entity->list);
>>     >>>>>>
>>     >>>>>>          if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>     >>>>>>                  drm_sched_rq_remove_fifo_locked(entity);
>>     >>>>>> --
>>     >>>>>> 2.34.1
>>     >>>>>>
>>     >>> Looks good. I'll pick it up into some other changes I've in tow, and repost
>>     >>> along with my changes, as they're somewhat related.
>>     >> Actually, the more I look at it, the more I think that we do want to set
>>     >> rq->current_entity to NULL in that function, in order to pick the next best entity
>>     >> (or scheduler for that matter), the next time around. See sched_entity.c,
>>     >> and drm_sched_rq_select_entity() where we start evaluating from the _next_
>>     >> entity.
>>     >>
>>     >> So, it is best to leave it to set it to NULL, for now.
>>     >
>>     > Apart from that this patch here could cause a crash when the entity is
>>     > the last one in the list.
>>     >
>>     > In this case current current_entity would be set to an incorrect upcast
>>     > of the head of the list.
>>
>>     Absolutely. I saw that, but in rejecting the patch, I didn't feel the need to mention it.
>>
>>     Thanks for looking into this.
>>
>>     Regards,
>>     Luben
>>
> 



More information about the dri-devel mailing list