[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Wed Jan 11 09:17:01 UTC 2023

On 10/01/2023 19:01, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 15:55, Matthew Brost wrote:
>>> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>>>
>>>>>
>>>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>        >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>>>       contexts (per
>>>>>>        >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>>>       invasive changes
>>>>>>        >>> to the shared code is in the spirit of the overall idea and
>>>>>> instead
>>>>>>        >>> opportunity should be used to look at way to refactor/improve
>>>>>>       drm_sched.
>>>>>>
>>>>>>
>>>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>>>> or really needs to drive a re-factor.  (More on that later.)
>>>>>> There's only one real issue which is that it fires off potentially a
>>>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>>>> pretty light and you're not likely to have more kthreads than
>>>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>>>> end of the world either.  Definitely something we can/should
>>>>>> optimize but if we went through with Xe without this patch, it would
>>>>>> probably be mostly ok.
>>>>>>
>>>>>>        >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>>>        >>
>>>>>>        >> I'm not really prepared to make large changes to DRM scheduler
>>>>>>       at the
>>>>>>        >> moment for Xe as they are not really required nor does Boris
>>>>>>       seem they
>>>>>>        >> will be required for his work either. I am interested to see
>>>>>>       what Boris
>>>>>>        >> comes up with.
>>>>>>        >>
>>>>>>        >>> Even on the low level, the idea to replace drm_sched threads
>>>>>>       with workers
>>>>>>        >>> has a few problems.
>>>>>>        >>>
>>>>>>        >>> To start with, the pattern of:
>>>>>>        >>>
>>>>>>        >>>    while (not_stopped) {
>>>>>>        >>>     keep picking jobs
>>>>>>        >>>    }
>>>>>>        >>>
>>>>>>        >>> Feels fundamentally in disagreement with workers (while
>>>>>>       obviously fits
>>>>>>        >>> perfectly with the current kthread design).
>>>>>>        >>
>>>>>>        >> The while loop breaks and worker exists if no jobs are ready.
>>>>>>
>>>>>>
>>>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>>>> better? One scheduling job per work item rather than one big work
>>>>>> item which handles all available jobs?
>>>>>
>>>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>>>> because it is a worst case / theoretical, and I think due more
>>>>> fundamental concerns.
>>>>>
>>>>> If we have to go back to the low level side of things, I've picked this
>>>>> random spot to consolidate what I have already mentioned and perhaps
>>>>> expand.
>>>>>
>>>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>>>
>>>>> """
>>>>> Generally, work items are not expected to hog a CPU and consume many
>>>>> cycles. That means maintaining just enough concurrency to prevent work
>>>>> processing from stalling should be optimal.
>>>>> """
>>>>>
>>>>> For unbound queues:
>>>>> """
>>>>> The responsibility of regulating concurrency level is on the users.
>>>>> """
>>>>>
>>>>> Given the unbound queues will be spawned on demand to service all queued
>>>>> work items (more interesting when mixing up with the system_unbound_wq),
>>>>> in the proposed design the number of instantiated worker threads does
>>>>> not correspond to the number of user threads (as you have elsewhere
>>>>> stated), but pessimistically to the number of active user contexts. That
>>>>> is the number which drives the maximum number of not-runnable jobs that
>>>>> can become runnable at once, and hence spawn that many work items, and
>>>>> in turn unbound worker threads.
>>>>>
>>>>> Several problems there.
>>>>>
>>>>> It is fundamentally pointless to have potentially that many more threads
>>>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>>
>>>> To make matters worse, if I follow the code correctly, all these per user
>>>> context worker thread / work items end up contending on the same lock or
>>>> circular buffer, both are one instance per GPU:
>>>>
>>>> guc_engine_run_job
>>>>    -> submit_engine
>>>>       a) wq_item_append
>>>>           -> wq_wait_for_space
>>>>             -> msleep
>>>
>>> a) is dedicated per xe_engine
>>
>> Hah true, what its for then? I thought throttling the LRCA ring is done via:
>>
> 
> This is a per guc_id 'work queue' which is used for parallel submission
> (e.g. multiple LRC tail values need to written atomically by the GuC).
> Again in practice there should always be space.

Speaking of guc id, where does blocking when none are available happen 
in the non parallel case?

>>    drm_sched_init(&ge->sched, &drm_sched_ops,
>> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
>>
>> Is there something more to throttle other than the ring? It is throttling
>> something using msleeps..
>>
>>> Also you missed the step of programming the ring which is dedicated per xe_engine
>>
>> I was trying to quickly find places which serialize on something in the
>> backend, ringbuffer emission did not seem to do that but maybe I missed
>> something.
>>
> 
> xe_ring_ops vfunc emit_job is called to write the ring.

Right but does it serialize between different contexts, I didn't spot 
that it does in which case it wasn't relevant to the sub story.

>>>
>>>>       b) xe_guc_ct_send
>>>>           -> guc_ct_send
>>>>             -> mutex_lock(&ct->lock);
>>>>             -> later a potential msleep in h2g_has_room
>>>
>>> Techincally there is 1 instance per GT not GPU, yes this is shared but
>>> in practice there will always be space in the CT channel so contention
>>> on the lock should be rare.
>>
>> Yeah I used the term GPU to be more understandable to outside audience.
>>
>> I am somewhat disappointed that the Xe opportunity hasn't been used to
>> improve upon the CT communication bottlenecks. I mean those backoff sleeps
>> and lock contention. I wish there would be a single thread in charge of the
>> CT channel and internal users (other parts of the driver) would be able to
>> send their requests to it in a more efficient manner, with less lock
>> contention and centralized backoff.
>>
> 
> Well the CT backend was more or less a complete rewrite. Mutexes
> actually work rather well to ensure fairness compared to the spin locks
> used in the i915. This code was pretty heavily reviewed by Daniel and
> both of us landed a big mutex for all of the CT code compared to the 3
> or 4 spin locks used in the i915.

Are the "nb" sends gone? But that aside, I wasn't meaning just the 
locking but the high level approach. Never  mind.

>>> I haven't read your rather long reply yet, but also FWIW using a
>>> workqueue has suggested by AMD (original authors of the DRM scheduler)
>>> when we ran this design by them.
>>
>> Commit message says nothing about that. ;)
>>
> 
> Yea I missed that, will fix in the next rev. Just dug through my emails
> and Christian suggested a work queue and Andrey also gave some input on
> the DRM scheduler design.
> 
> Also in the next will likely update the run_wq to be passed in by the
> user.

Yes, and IMO that may need to be non-optional.

Regards,

Tvrtko