[PATCH 8/8] drm/sched: Further optimise drm_sched_entity_push_job

Fri Sep 13 15:01:14 UTC 2024

On 10/09/2024 16:03, Christian König wrote:
> Am 10.09.24 um 11:46 schrieb Tvrtko Ursulin:
>>
>> On 10/09/2024 10:08, Christian König wrote:
>>> Am 09.09.24 um 19:19 schrieb Tvrtko Ursulin:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>
>>>> Having removed one re-lock cycle on the entity->lock in a patch titled
>>>> "drm/sched: Optimise drm_sched_entity_push_job", with only a tiny bit
>>>> larger refactoring we can do the same optimisation on the rq->lock.
>>>> (Currently both drm_sched_rq_add_entity() and
>>>> drm_sched_rq_update_fifo_locked() take and release the same lock.)
>>>
>>> I think that goes into the wrong direction.
>>>
>>> Probably better to move this here into drm_sched_rq_add_entity():
>>>
>>>            if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>               drm_sched_rq_update_fifo_locked(entity, submit_ts);
>>>
>>> We can then also drop adding the entity to the rr list when FIFO is 
>>> in use.
>>
>> Unfortuntely there is a few other places which appear to rely on the 
>> list. Like drm_sched_fini,
> 
> That should be only a warning.

Warning as in?

>> drm_sched_increase_karma and 
> 
> The karma handling was another bad idea from AMD how to populate back 
> errors to userspace and I've just recently documented together with Sima 
> that we should use dma-fence errors instead.
> 
> Just didn't had time to tackle cleaning that up yet.
> 
>> even amdgpu_job_stop_all_jobs_on_sched.
> 
> Uff, seeing that for the first time just now. Another bad idea how to 
> handle things which doesn't take the appropriate locks and looks racy to 
> me.
> 
> 
>> Latter could perhaps be solved by adding an iterator helper to the 
>> scheduler, which would perhaps be a good move for component isolation. 
>> And first two could be handled by implementing a complete and mutually 
>> exclusive duality of how entities are walked depending on scheduling 
>> mode. Plus making the scheduling mode only be configurable at boot. It 
>> feels doable but significant work and in the meantime removing the 
>> double re-lock maybe acceptable?
> 
> I don't think we should optimize for something we want to remove in the 
> long term.

I knew using the term optimise would just making things more difficult 
for myself. :) Lets view this as cleaning up the API to avoid the 
inelegance of taking the same lock twice right next to each other.

If we can achieve this while not making the API worse then there is 
nothing to lose either short, med or long term.

> If possible I would rather say that we should completely drop the RR 
> approach and only use FIFO or even something more sophisticated.

No complaints from me, but I don't know how that would work other than 
putting a depreciation warning if someone selected RR. And keeping that 
for a good number of kernel releases. Any other ideas?

Regards,

Tvrtko