[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Tue Jan 10 12:19:35 UTC 2023

On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
>>      >>> AFAICT it proposes to have 1:1 between *userspace* created
>>     contexts (per
>>      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>     invasive changes
>>      >>> to the shared code is in the spirit of the overall idea and 
>> instead
>>      >>> opportunity should be used to look at way to refactor/improve
>>     drm_sched.
>>
>>
>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
>> really needs to drive a re-factor.  (More on that later.)  There's 
>> only one real issue which is that it fires off potentially a lot of 
>> kthreads. Even that's not that bad given that kthreads are pretty 
>> light and you're not likely to have more kthreads than userspace 
>> threads which are much heavier.  Not ideal, but not the end of the 
>> world either.  Definitely something we can/should optimize but if we 
>> went through with Xe without this patch, it would probably be mostly ok.
>>
>>      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>      >>
>>      >> I'm not really prepared to make large changes to DRM scheduler
>>     at the
>>      >> moment for Xe as they are not really required nor does Boris
>>     seem they
>>      >> will be required for his work either. I am interested to see
>>     what Boris
>>      >> comes up with.
>>      >>
>>      >>> Even on the low level, the idea to replace drm_sched threads
>>     with workers
>>      >>> has a few problems.
>>      >>>
>>      >>> To start with, the pattern of:
>>      >>>
>>      >>>    while (not_stopped) {
>>      >>>     keep picking jobs
>>      >>>    }
>>      >>>
>>      >>> Feels fundamentally in disagreement with workers (while
>>     obviously fits
>>      >>> perfectly with the current kthread design).
>>      >>
>>      >> The while loop breaks and worker exists if no jobs are ready.
>>
>>
>> I'm not very familiar with workqueues. What are you saying would fit 
>> better? One scheduling job per work item rather than one big work item 
>> which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is 
> potentially unbound in runtime. But it is a bit moot conceptual mismatch 
> because it is a worst case / theoretical, and I think due more 
> fundamental concerns.
> 
> If we have to go back to the low level side of things, I've picked this 
> random spot to consolidate what I have already mentioned and perhaps 
> expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many 
> cycles. That means maintaining just enough concurrency to prevent work 
> processing from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued 
> work items (more interesting when mixing up with the system_unbound_wq), 
> in the proposed design the number of instantiated worker threads does 
> not correspond to the number of user threads (as you have elsewhere 
> stated), but pessimistically to the number of active user contexts. That 
> is the number which drives the maximum number of not-runnable jobs that 
> can become runnable at once, and hence spawn that many work items, and 
> in turn unbound worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads 
> than the number of CPU cores - it simply creates a scheduling storm.

To make matters worse, if I follow the code correctly, all these per 
user context worker thread / work items end up contending on the same 
lock or circular buffer, both are one instance per GPU:

guc_engine_run_job
  -> submit_engine
     a) wq_item_append
         -> wq_wait_for_space
           -> msleep
     b) xe_guc_ct_send
         -> guc_ct_send
           -> mutex_lock(&ct->lock);
           -> later a potential msleep in h2g_has_room

Regards,

Tvrtko