[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Tue Jan 10 15:55:34 UTC 2023

On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and
> > > instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > or really needs to drive a re-factor.  (More on that later.) 
> > > There's only one real issue which is that it fires off potentially a
> > > lot of kthreads. Even that's not that bad given that kthreads are
> > > pretty light and you're not likely to have more kthreads than
> > > userspace threads which are much heavier.  Not ideal, but not the
> > > end of the world either.  Definitely something we can/should
> > > optimize but if we went through with Xe without this patch, it would
> > > probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work
> > > item which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent work
> > processing from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts. That
> > is the number which drives the maximum number of not-runnable jobs that
> > can become runnable at once, and hence spawn that many work items, and
> > in turn unbound worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> 
> To make matters worse, if I follow the code correctly, all these per user
> context worker thread / work items end up contending on the same lock or
> circular buffer, both are one instance per GPU:
> 
> guc_engine_run_job
>  -> submit_engine
>     a) wq_item_append
>         -> wq_wait_for_space
>           -> msleep

a) is dedicated per xe_engine

Also you missed the step of programming the ring which is dedicated per xe_engine

>     b) xe_guc_ct_send
>         -> guc_ct_send
>           -> mutex_lock(&ct->lock);
>           -> later a potential msleep in h2g_has_room

Techincally there is 1 instance per GT not GPU, yes this is shared but
in practice there will always be space in the CT channel so contention
on the lock should be rare.

I haven't read your rather long reply yet, but also FWIW using a
workqueue has suggested by AMD (original authors of the DRM scheduler)
when we ran this design by them.

Matt 

> 
> Regards,
> 
> Tvrtko