[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Wed Jan 11 09:09:45 UTC 2023

On 11/01/2023 01:13, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
>> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>>       >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>      contexts (per
>>>>       >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>      invasive changes
>>>>       >>> to the shared code is in the spirit of the overall idea and instead
>>>>       >>> opportunity should be used to look at way to refactor/improve
>>>>      drm_sched.
>>>>
>>>>
>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
>>>> really needs to drive a re-factor.  (More on that later.)  There's only
>>>> one real issue which is that it fires off potentially a lot of kthreads.
>>>> Even that's not that bad given that kthreads are pretty light and you're
>>>> not likely to have more kthreads than userspace threads which are much
>>>> heavier.  Not ideal, but not the end of the world either.  Definitely
>>>> something we can/should optimize but if we went through with Xe without
>>>> this patch, it would probably be mostly ok.
>>>>
>>>>       >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>       >>
>>>>       >> I'm not really prepared to make large changes to DRM scheduler
>>>>      at the
>>>>       >> moment for Xe as they are not really required nor does Boris
>>>>      seem they
>>>>       >> will be required for his work either. I am interested to see
>>>>      what Boris
>>>>       >> comes up with.
>>>>       >>
>>>>       >>> Even on the low level, the idea to replace drm_sched threads
>>>>      with workers
>>>>       >>> has a few problems.
>>>>       >>>
>>>>       >>> To start with, the pattern of:
>>>>       >>>
>>>>       >>>    while (not_stopped) {
>>>>       >>>     keep picking jobs
>>>>       >>>    }
>>>>       >>>
>>>>       >>> Feels fundamentally in disagreement with workers (while
>>>>      obviously fits
>>>>       >>> perfectly with the current kthread design).
>>>>       >>
>>>>       >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work item
>>>> which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more fundamental
>>> concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many cycles.
>>> That means maintaining just enough concurrency to prevent work processing
>>> from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq), in
>>> the proposed design the number of instantiated worker threads does not
>>> correspond to the number of user threads (as you have elsewhere stated), but
>>> pessimistically to the number of active user contexts. That is the number
>>> which drives the maximum number of not-runnable jobs that can become
>>> runnable at once, and hence spawn that many work items, and in turn unbound
>>> worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>
>>
>> We can use a different work queue if this is an issue, have a FIXME
>> which indicates we should allow the user to pass in the work queue.
>>
>>> Unbound workers have no CPU / cache locality either and no connection with
>>> the CPU scheduler to optimize scheduling patterns. This may matter either on
>>> large systems or on small ones. Whereas the current design allows for
>>> scheduler to notice userspace CPU thread keeps waking up the same drm
>>> scheduler kernel thread, and so it can keep them on the same CPU, the
>>> unbound workers lose that ability and so 2nd CPU might be getting woken up
>>> from low sleep for every submission.
>>>
>>
>> I guess I don't understand kthread vs. workqueue scheduling internals.
>>   
> 
> Looked into this and we are not using unbound workers rather we are just
> using the system_wq which is indeed bound. Again we can change this so a
> user can just pass in worker too. After doing a of research bound
> workers allows the scheduler to use locality too avoid that exact
> problem your reading.
> 
> TL;DR I'm not buying any of these arguments although it is possible I am
> missing something.

Well you told me it's using unbound.. message id 
Y7dEjcuc1arHBTGu at DUT025-TGLU.fm.intel.com:

"""
Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.
"""

With bound workers you will indeed get CPU locality. I am not sure what 
it will do in terms of concurrency. If it will serialize work items to 
fewer spawned workers that will be good for the CT contention issue, but 
may negatively affect latency. And possibly preemption / time slicing 
decisions since the order of submitting to the backend will not be in 
the order of context priority, hence high prio may be submitted right 
after low and immediately trigger preemption.

Anyway, since you are not buying any arguments on paper perhaps you are 
more open towards testing. If you would adapt gem_wsim for Xe you would 
be able to spawn N simulated transcode sessions on any Gen11+ machine 
and try it out.

For example:

gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600

That will run you 36 parallel transcoding sessions streams for 600 
frames each. No client setup needed whatsoever apart from compiling IGT.

In the past that was quite a handy tool to identify scheduling issues, 
or validate changes against. All workloads with the media prefix have 
actually been hand crafted by looking at what real media pipelines do 
with real data. Few years back at least.

It could show you real world behaviour of the kworkers approach and it 
could also enable you to cross reference any power and performance 
changes relative to i915. Background story there is that media servers 
like to fit N streams to a server and if a change comes along which 
suddenly makes only N-1 stream fit before dropping out of realtime, 
that's a big problem.

If you will believe me there is value in that kind of testing I am happy 
to help you add Xe support to the tool, time permitting so possibly 
guidance only at the moment.

Regards,

Tvrtko