[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

John Harrison john.c.harrison at intel.com
Wed Jan 11 18:52:54 UTC 2023


On 1/11/2023 10:07, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
>> On 10/01/2023 19:01, Matthew Brost wrote:
>>> On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
>>>> On 10/01/2023 15:55, Matthew Brost wrote:
>>>>> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>>>>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>>>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>>>         >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>>>>>        contexts (per
>>>>>>>>         >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>>>>>        invasive changes
>>>>>>>>         >>> to the shared code is in the spirit of the overall idea and
>>>>>>>> instead
>>>>>>>>         >>> opportunity should be used to look at way to refactor/improve
>>>>>>>>        drm_sched.
>>>>>>>>
>>>>>>>>
>>>>>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>>>>>> or really needs to drive a re-factor.  (More on that later.)
>>>>>>>> There's only one real issue which is that it fires off potentially a
>>>>>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>>>>>> pretty light and you're not likely to have more kthreads than
>>>>>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>>>>>> end of the world either.  Definitely something we can/should
>>>>>>>> optimize but if we went through with Xe without this patch, it would
>>>>>>>> probably be mostly ok.
>>>>>>>>
>>>>>>>>         >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>>>>>         >>
>>>>>>>>         >> I'm not really prepared to make large changes to DRM scheduler
>>>>>>>>        at the
>>>>>>>>         >> moment for Xe as they are not really required nor does Boris
>>>>>>>>        seem they
>>>>>>>>         >> will be required for his work either. I am interested to see
>>>>>>>>        what Boris
>>>>>>>>         >> comes up with.
>>>>>>>>         >>
>>>>>>>>         >>> Even on the low level, the idea to replace drm_sched threads
>>>>>>>>        with workers
>>>>>>>>         >>> has a few problems.
>>>>>>>>         >>>
>>>>>>>>         >>> To start with, the pattern of:
>>>>>>>>         >>>
>>>>>>>>         >>>    while (not_stopped) {
>>>>>>>>         >>>     keep picking jobs
>>>>>>>>         >>>    }
>>>>>>>>         >>>
>>>>>>>>         >>> Feels fundamentally in disagreement with workers (while
>>>>>>>>        obviously fits
>>>>>>>>         >>> perfectly with the current kthread design).
>>>>>>>>         >>
>>>>>>>>         >> The while loop breaks and worker exists if no jobs are ready.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>>>>>> better? One scheduling job per work item rather than one big work
>>>>>>>> item which handles all available jobs?
>>>>>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>>>>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>>>>>> because it is a worst case / theoretical, and I think due more
>>>>>>> fundamental concerns.
>>>>>>>
>>>>>>> If we have to go back to the low level side of things, I've picked this
>>>>>>> random spot to consolidate what I have already mentioned and perhaps
>>>>>>> expand.
>>>>>>>
>>>>>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>>>>>
>>>>>>> """
>>>>>>> Generally, work items are not expected to hog a CPU and consume many
>>>>>>> cycles. That means maintaining just enough concurrency to prevent work
>>>>>>> processing from stalling should be optimal.
>>>>>>> """
>>>>>>>
>>>>>>> For unbound queues:
>>>>>>> """
>>>>>>> The responsibility of regulating concurrency level is on the users.
>>>>>>> """
>>>>>>>
>>>>>>> Given the unbound queues will be spawned on demand to service all queued
>>>>>>> work items (more interesting when mixing up with the system_unbound_wq),
>>>>>>> in the proposed design the number of instantiated worker threads does
>>>>>>> not correspond to the number of user threads (as you have elsewhere
>>>>>>> stated), but pessimistically to the number of active user contexts. That
>>>>>>> is the number which drives the maximum number of not-runnable jobs that
>>>>>>> can become runnable at once, and hence spawn that many work items, and
>>>>>>> in turn unbound worker threads.
>>>>>>>
>>>>>>> Several problems there.
>>>>>>>
>>>>>>> It is fundamentally pointless to have potentially that many more threads
>>>>>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>>>> To make matters worse, if I follow the code correctly, all these per user
>>>>>> context worker thread / work items end up contending on the same lock or
>>>>>> circular buffer, both are one instance per GPU:
>>>>>>
>>>>>> guc_engine_run_job
>>>>>>     -> submit_engine
>>>>>>        a) wq_item_append
>>>>>>            -> wq_wait_for_space
>>>>>>              -> msleep
>>>>> a) is dedicated per xe_engine
>>>> Hah true, what its for then? I thought throttling the LRCA ring is done via:
>>>>
>>> This is a per guc_id 'work queue' which is used for parallel submission
>>> (e.g. multiple LRC tail values need to written atomically by the GuC).
>>> Again in practice there should always be space.
>> Speaking of guc id, where does blocking when none are available happen in
>> the non parallel case?
>>
> We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
> think that is more than enough and can just reject xe_engine creation if
> we run out of guc_ids. If this proves to false, we can fix this but the
> guc_id stealing the i915 is rather complicated and hopefully not needed.
>
> We will limit the number of guc_ids allowed per user pid to reasonible
> number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
> whatever they want.
What about doorbells? As some point, we will have to start using those 
and they are a much more limited resource - 256 total and way less with VFs.

John.



More information about the Intel-gfx mailing list