[PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Tue Aug 22 09:51:13 UTC 2023

Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
> On Mon, Aug 21, 2023 at 1:13 PM Christian König 
> <christian.koenig at amd.com> wrote:
>
>     [SNIP]
>     So as long as nobody from userspace comes and says we absolutely
>     need to
>     optimize this use case I would rather not do it.
>
>
> This is a place where nouveau's needs are legitimately different from 
> AMD or Intel, I think.  NVIDIA's command streamer model is very 
> different from AMD and Intel.  On AMD and Intel, each EXEC turns into 
> a single small packet (on the order of 16B) which kicks off a command 
> buffer.  There may be a bit of cache management or something around it 
> but that's it.  From there, it's userspace's job to make one command 
> buffer chain to another until it's finally done and then do a 
> "return", whatever that looks like.
>
> NVIDIA's model is much more static.  Each packet in the HW/FW ring is 
> an address and a size and that much data is processed and then it 
> grabs the next packet and processes. The result is that, if we use 
> multiple buffers of commands, there's no way to chain them together.  
> We just have to pass the whole list of buffers to the kernel.

So far that is actually completely identical to what AMD has.

> A single EXEC ioctl / job may have 500 such addr+size packets 
> depending on how big the command buffer is.

And that is what I don't understand. Why would you need 100dreds of such 
addr+size packets?

This is basically identical to what AMD has (well on newer hw there is 
an extension in the CP packets to JUMP/CALL subsequent IBs, but this 
isn't widely used as far as I know).

Previously the limit was something like 4 which we extended to because 
Bas came up with similar requirements for the AMD side from RADV.

But essentially those approaches with 100dreds of IBs doesn't sound like 
a good idea to me.

> It gets worse on pre-Turing hardware where we have to split the batch 
> for every single DrawIndirect or DispatchIndirect.
>
> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable 
> model if you assume that userspace is feeding the firmware.  When 
> that's happening, you just have a userspace thread that sits there and 
> feeds the ringbuffer with whatever is next and you can marshal as much 
> data through as you want. Sure, it'd be nice to have a 2nd level batch 
> thing that gets launched from the FW ring and has all the individual 
> launch commands but it's not at all necessary.
>
> What does that mean from a gpu_scheduler PoV? Basically, it means a 
> variable packet size.
>
> What does this mean for implementation? IDK.  One option would be to 
> teach the scheduler about actual job sizes. Another would be to 
> virtualize it and have another layer underneath the scheduler that 
> does the actual feeding of the ring. Another would be to decrease the 
> job size somewhat and then have the front-end submit as many jobs as 
> it needs to service userspace and only put the out-fences on the last 
> job. All the options kinda suck.

Yeah, agree. The job size Danilo suggested is still the least painful.

Christian.

>
> ~Faith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230822/f81d334f/attachment-0001.htm>