[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Jan 10 16:50:55 UTC 2023
On 10/01/2023 15:55, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>> >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>> contexts (per
>>>> >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>> invasive changes
>>>> >>> to the shared code is in the spirit of the overall idea and
>>>> instead
>>>> >>> opportunity should be used to look at way to refactor/improve
>>>> drm_sched.
>>>>
>>>>
>>>> Maybe? I'm not convinced that what Xe is doing is an abuse at all
>>>> or really needs to drive a re-factor. (More on that later.)
>>>> There's only one real issue which is that it fires off potentially a
>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>> pretty light and you're not likely to have more kthreads than
>>>> userspace threads which are much heavier. Not ideal, but not the
>>>> end of the world either. Definitely something we can/should
>>>> optimize but if we went through with Xe without this patch, it would
>>>> probably be mostly ok.
>>>>
>>>> >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>> >>
>>>> >> I'm not really prepared to make large changes to DRM scheduler
>>>> at the
>>>> >> moment for Xe as they are not really required nor does Boris
>>>> seem they
>>>> >> will be required for his work either. I am interested to see
>>>> what Boris
>>>> >> comes up with.
>>>> >>
>>>> >>> Even on the low level, the idea to replace drm_sched threads
>>>> with workers
>>>> >>> has a few problems.
>>>> >>>
>>>> >>> To start with, the pattern of:
>>>> >>>
>>>> >>> while (not_stopped) {
>>>> >>> keep picking jobs
>>>> >>> }
>>>> >>>
>>>> >>> Feels fundamentally in disagreement with workers (while
>>>> obviously fits
>>>> >>> perfectly with the current kthread design).
>>>> >>
>>>> >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work
>>>> item which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more
>>> fundamental concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps
>>> expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many
>>> cycles. That means maintaining just enough concurrency to prevent work
>>> processing from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq),
>>> in the proposed design the number of instantiated worker threads does
>>> not correspond to the number of user threads (as you have elsewhere
>>> stated), but pessimistically to the number of active user contexts. That
>>> is the number which drives the maximum number of not-runnable jobs that
>>> can become runnable at once, and hence spawn that many work items, and
>>> in turn unbound worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>
>> To make matters worse, if I follow the code correctly, all these per user
>> context worker thread / work items end up contending on the same lock or
>> circular buffer, both are one instance per GPU:
>>
>> guc_engine_run_job
>> -> submit_engine
>> a) wq_item_append
>> -> wq_wait_for_space
>> -> msleep
>
> a) is dedicated per xe_engine
Hah true, what its for then? I thought throttling the LRCA ring is done via:
drm_sched_init(&ge->sched, &drm_sched_ops,
e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
Is there something more to throttle other than the ring? It is
throttling something using msleeps..
> Also you missed the step of programming the ring which is dedicated per xe_engine
I was trying to quickly find places which serialize on something in the
backend, ringbuffer emission did not seem to do that but maybe I missed
something.
>
>> b) xe_guc_ct_send
>> -> guc_ct_send
>> -> mutex_lock(&ct->lock);
>> -> later a potential msleep in h2g_has_room
>
> Techincally there is 1 instance per GT not GPU, yes this is shared but
> in practice there will always be space in the CT channel so contention
> on the lock should be rare.
Yeah I used the term GPU to be more understandable to outside audience.
I am somewhat disappointed that the Xe opportunity hasn't been used to
improve upon the CT communication bottlenecks. I mean those backoff
sleeps and lock contention. I wish there would be a single thread in
charge of the CT channel and internal users (other parts of the driver)
would be able to send their requests to it in a more efficient manner,
with less lock contention and centralized backoff.
> I haven't read your rather long reply yet, but also FWIW using a
> workqueue has suggested by AMD (original authors of the DRM scheduler)
> when we ran this design by them.
Commit message says nothing about that. ;)
Regards,
Tvrtko
More information about the dri-devel
mailing list