[RFC v3 06/14] drm/sched: Implement RR via FIFO

Fri Apr 4 09:27:17 UTC 2025

On 02/04/2025 14:37, Christian König wrote:
> Am 02.04.25 um 15:22 schrieb Michel Dänzer:
>> On 2025-04-02 14:00, Philipp Stanner wrote:
>>> On Wed, 2025-04-02 at 12:58 +0200, Michel Dänzer wrote:
>>>> On 2025-04-02 12:46, Philipp Stanner wrote:
>>>>> On Mon, 2025-03-31 at 21:16 +0100, Tvrtko Ursulin wrote:
>>>>>> Round-robin being the non-default policy and unclear how much it
>>>>>> is
>>>>>> used,
>>>>>> we can notice that it can be implemented using the FIFO data
>>>>>> structures if
>>>>>> we only invent a fake submit timestamp which is monotonically
>>>>>> increasing
>>>>>> inside drm_sched_rq instances.
>>>>>>
>>>>>> So instead of remembering which was the last entity the scheduler
>>>>>> worker
>>>>>> picked, we can bump the picked one to the bottom of the tree,
>>>>>> achieving
>>>>>> the same round-robin behaviour.
>>>>>>
>>>>>> Advantage is that we can consolidate to a single code path and
>>>>>> remove
>>>>>> a
>>>>>> bunch of code. Downside is round-robin mode now needs to lock on
>>>>>> the
>>>>>> job
>>>>>> pop path but that should not be visible.
>>>>> Why did you decide to do it that way and then later remove RR &
>>>>> FIFO
>>>>> alltogether in patch 10, basically?
>>>>>
>>>>> I think the far cleaner way for our development-process would be a
>>>>> separate patch(-series) that *removes* RR completely. Advantages
>>>>> are:
>>>>>
>>>>>     1. It should be relatively easy to do
>>>>>     2. It would simplify the existing code base independently of
>>>>> what
>>>>>        happens with your RFC series here
>>>>>     3. Before changing everyone's scheduling policy to a completely
>>>>> new,
>>>>>        deadline-based one, we could first be sure for a few release
>>>>>        cycles that everyone is now on FIFO, establishing common
>>>>> ground.
>>>>>     4. We could CC every- and anyone who might use RR or might know
>>>>>        someone who does
>>>>>     5. If it turns out we screwed up and someone really relies on
>>>>> RR, it
>>>>>        would be easy to revert.
>>>>>
>>>>> I am not aware of any RR users and have, in past discussions, never
>>>>> heard of any. So removing it is more tempting for the above
>>>>> reasons.
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2516 has a bunch of
>>>> RR users...
>>> Right, there's a number of people complaining about the regression. But
>>> what I'm interested in is: how did it evolve since then. Are there
>>> distributions who set the module parameter? Does Steam do it? Or is it
>>> individual users who work around the problem that way?
>> I know only of the latter.
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2516#note_2679509
> 
> Tvrtko, you should probably prepare a branch on fdo and let the people 
> on that bug report test it.

If people are interested to test I have two branches. This qddl approach:

https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-deadline?ref_type=heads

And a very early CFS-like experiment:

https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-cfs?ref_type=heads

I would certainly be very interested to get feedback.

>>> ^ this comment for example seems to indicate that on newer Wayland
>>> versions part of the problem has vanished?
>> That's about using the Wine wayland driver (which uses the Wayland protocol directly) instead of the x11 driver (which uses the X11 protocol via Xwayland). Xwayland not being involved can avoid at least some of the issues (in particular, the scenario I described inhttps://gitlab.freedesktop.org/drm/amd/-/issues/2516#note_2119750 can't happen then). That doesn't solve the issues when Xwayland is involved though, just avoids them.
>>
> 
>  From what I remember the bug condition is that the display server 
> submits works after the client has started rendering the next frame. 
> With RR that resulted in the display servers work to be picked first 
> while with FIFO the clients works would be picked first.
> 
> Arguable that RR picked the display server was just coincident while 
> with Tvrtko's work the fair queuing should make that perfectly intentional.
> 
> So this patch set here should (in theory) help a lot with that bug 
> report and finally make RR superfluous.

Disclaimer that it is way too generous to call what I made fair. It may 
help with the above described case *if* the client has queue depth > 1, 
which would then push out its deadline a bit further out so display 
server could beat it even if submitted later. But it also may need 
refining to account for currently executing jobs as the queue-depth too 
(currently it does not). And also to perhaps base the deadline buckets 
on some dynamically adjusting runtime measure rather than hardcoding them.

Regards,

Tvrtko