[PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
Christian König
christian.koenig at amd.com
Tue Aug 22 09:51:13 UTC 2023
Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
> On Mon, Aug 21, 2023 at 1:13 PM Christian König
> <christian.koenig at amd.com> wrote:
>
> [SNIP]
> So as long as nobody from userspace comes and says we absolutely
> need to
> optimize this use case I would rather not do it.
>
>
> This is a place where nouveau's needs are legitimately different from
> AMD or Intel, I think. NVIDIA's command streamer model is very
> different from AMD and Intel. On AMD and Intel, each EXEC turns into
> a single small packet (on the order of 16B) which kicks off a command
> buffer. There may be a bit of cache management or something around it
> but that's it. From there, it's userspace's job to make one command
> buffer chain to another until it's finally done and then do a
> "return", whatever that looks like.
>
> NVIDIA's model is much more static. Each packet in the HW/FW ring is
> an address and a size and that much data is processed and then it
> grabs the next packet and processes. The result is that, if we use
> multiple buffers of commands, there's no way to chain them together.
> We just have to pass the whole list of buffers to the kernel.
So far that is actually completely identical to what AMD has.
> A single EXEC ioctl / job may have 500 such addr+size packets
> depending on how big the command buffer is.
And that is what I don't understand. Why would you need 100dreds of such
addr+size packets?
This is basically identical to what AMD has (well on newer hw there is
an extension in the CP packets to JUMP/CALL subsequent IBs, but this
isn't widely used as far as I know).
Previously the limit was something like 4 which we extended to because
Bas came up with similar requirements for the AMD side from RADV.
But essentially those approaches with 100dreds of IBs doesn't sound like
a good idea to me.
> It gets worse on pre-Turing hardware where we have to split the batch
> for every single DrawIndirect or DispatchIndirect.
>
> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
> model if you assume that userspace is feeding the firmware. When
> that's happening, you just have a userspace thread that sits there and
> feeds the ringbuffer with whatever is next and you can marshal as much
> data through as you want. Sure, it'd be nice to have a 2nd level batch
> thing that gets launched from the FW ring and has all the individual
> launch commands but it's not at all necessary.
>
> What does that mean from a gpu_scheduler PoV? Basically, it means a
> variable packet size.
>
> What does this mean for implementation? IDK. One option would be to
> teach the scheduler about actual job sizes. Another would be to
> virtualize it and have another layer underneath the scheduler that
> does the actual feeding of the ring. Another would be to decrease the
> job size somewhat and then have the front-end submit as many jobs as
> it needs to service userspace and only put the out-fences on the last
> job. All the options kinda suck.
Yeah, agree. The job size Danilo suggested is still the least painful.
Christian.
>
> ~Faith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230822/f81d334f/attachment-0001.htm>
More information about the dri-devel
mailing list