[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Thu Jan 12 10:11:03 UTC 2023

On Thu, 12 Jan 2023 10:32:18 +0100
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > Hi Daniel,
> > 
> > On Wed, 11 Jan 2023 22:47:02 +0100
> > Daniel Vetter <daniel at ffwll.ch> wrote:
> >   
> > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > <boris.brezillon at collabora.com> wrote:  
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > Daniel Vetter <daniel at ffwll.ch> wrote:
> > > >    
> > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:    
> > > > > > Hi Jason,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > Jason Ekstrand <jason at jlekstrand.net> wrote:
> > > > > >    
> > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost at intel.com>
> > > > > > > wrote:
> > > > > > >    
> > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > Boris Brezillon <boris.brezillon at collabora.com> wrote:
> > > > > > > > >    
> > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon at collabora.com> wrote:
> > > > > > > > > >    
> > > > > > > > > > > Hello Matthew,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > Matthew Brost <matthew.brost at intel.com> wrote:
> > > > > > > > > > >    
> > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first    
> > > > > > > > this    
> > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > guaranteed to be the same completion even if targeting the same    
> > > > > > > > hardware    
> > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the    
> > > > > > > > GuC,    
> > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a    
> > > > > > > > using    
> > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR    
> > > > > > > > falls    
> > > > > > > > > > > > apart as the TDR expects submission order == completion order.    
> > > > > > > > Using a    
> > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this    
> > > > > > > > problem.    
> > > > > > > > > > >
> > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a    
> > > > > > > > FW-assisted    
> > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > because the integration with drm_sched was painful, but also because    
> > > > > > > > I    
> > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.    
> > > > > > > > Giving    
> > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably    
> > > > > > > > might    
> > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.    
> > > > > > > > > >
> > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue    
> > > > > > > >
> > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > scheduler fits quite well.
> > > > > > > >    
> > > > > > >
> > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > >
> > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > categorization.)
> > > > > > >
> > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > >
> > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > windows.
> > > > > > >
> > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > >
> > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > >
> > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > able to juggle if you run out.    
> > > > > >
> > > > > > That's more or less what I do right now, except I don't use the
> > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > interest in Xe's approach.    
> > > > >
> > > > > So another option for these few fw queue slots schedulers would be to
> > > > > treat them as vram and enlist ttm.
> > > > >
> > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > the perfect thing.
> > > > >
> > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > all.
> > > > >
> > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > helper code.    
> > > >
> > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > also don't know much about TTM, so I'll keep digging.    
> > > 
> > > ttm is about moving stuff in&out of a limited space and gives you some
> > > nice tooling for pipelining it all. It doesn't care whether that space
> > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > iirc.  
> > 
> > Ok.
> >   
> > >   
> > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > severe restrictions in number of queues.    
> > > >
> > > > Honestly, I don't think there's much difference between those two cases
> > > > already. There's just a bunch of additional code to schedule queues on
> > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > driver specific. The job queuing front-end pretty much achieves what
> > > > drm_sched does already: queuing job to entities, checking deps,
> > > > submitting job to HW (in our case, writing to the command stream ring
> > > > buffer). Things start to differ after that point: once a scheduling
> > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > re-evaluate, if needed.
> > > >
> > > > I'm all for using generic code when it makes sense, even if that means
> > > > adding this common code when it doesn't exists, but I don't want to be
> > > > dragged into some major refactoring that might take years to land.
> > > > Especially if pancsf is the first
> > > > FW-assisted-scheduler-with-few-FW-slot driver.    
> > > 
> > > I don't see where there's a major refactoring that you're getting dragged into?  
> > 
> > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > sure we're on the same page :-).
> >   
> > > 
> > > Yes there's a huge sprawling discussion right now, but I think that's
> > > just largely people getting confused.  
> > 
> > I definitely am :-).
> >   
> > > 
> > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > point.  
> > 
> > Ok, thanks for the pointers. I'll have a look and see how I could use
> > that. I guess that's about getting access to the FW slots with some
> > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > case, I'll have to think about it, because that's a major shift from
> > what we're doing now, and I'm afraid this could lead to starving
> > non-resident entities if all resident entities keep receiving new jobs
> > to execute. Unless we put some sort of barrier when giving access to a
> > slot, so we evict the entity when it's done executing the stuff it had
> > when it was given access to this slot. But then, again, there are other
> > constraints to take into account for the Arm Mali CSF case:
> > 
> > - it's more efficient to update all FW slots at once, because each
> >   update of a slot might require updating priorities of the other slots
> >   (FW mandates unique slot priorities, and those priorities depend on
> >   the entity priority/queue-ordering)
> > - context/FW slot switches have a non-negligible cost (FW needs to
> >   suspend the context and save the state every time there such a
> >   switch), so, limiting the number of FW slot updates might prove
> >   important  
> 
> I frankly think you're overworrying. When you have 31+ contexts running at
> the same time, you have bigger problems. At that point there's two
> use-cases:
> 1. system is overloaded, the user will reach for reset button anyway
> 2. temporary situation, all you have to do is be roughly fair enough to get
>    through it before case 1 happens.
>  
> Trying to write a perfect scheduler for this before we have actual
> benchmarks that justify the effort seems like pretty serious overkill.
> That's why I think the simplest solution is the one we should have:
> 
> - drm/sched frontend. If you get into slot exhaustion that alone will
>   ensure enough fairness

We're talking about the CS ring buffer slots here, right?

> 
> - LRU list of slots, with dma_fence so you can pipeline/batch up changes
>   as needed (but I honestly wouldn't worry about the batching before
>   you've shown an actual need for this in some benchmark/workload, even
>   piglit shouldn't have this many things running concurrently I think, you
>   don't have that many cpu cores). Between drm/sched and the lru you will
>   have an emergent scheduler that cycles through all runnable gpu jobs.
> 
> - If you want to go fancy, have eviction tricks like skipping currently
>   still active gpu context with higher priority than the one that you need
>   to find a slot for.
> 
> - You don't need time slicing in this, not even for compute. compute is
>   done with preempt context fences, if you give them a minimum scheduling
>   quanta you'll have a very basic round robin scheduler as an emergent
>   thing.
> 
> Any workload were it matters will be scheduled by the fw directly, with
> drm/sched only being the dma_fence dependcy sorter. My take is that if you
> spend more than a hundred or so lines with slot allocation logic
> (excluding the hw code to load/unload a slot) you're probably doing some
> serious overengineering.

Let me see if I got this right:

- we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
  where hw_submission_limit == available_slots_in_ring_buf
- when ->run_job() is called, we write the RUN_JOB() instruction
  sequence to the next available ringbuf slot and queue the entity to
  the FW-slot queue
  * if a slot is directly available, we program the slot directly
  * if no slots are available, but some slots are done with the jobs
    they were given (last job fence signaled), we evict the LRU entity
    (possibly taking priority into account) and use this slot for the
    new entity
  * if no slots are available and all currently assigned slots
    contain busy entities, we queue the entity to a pending list
    (possibly one list per prio)

I'll need to make sure this still works with the concept of group (it's
not a single queue we schedule, it's a group of queues, meaning that we
have N fences to watch to determine if the slot is busy or not, but
that should be okay).