[Intel-xe] [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
Bas Nieuwenhuizen
bas at basnieuwenhuizen.nl
Thu Aug 24 11:50:00 UTC 2023
On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <faith at gfxstrand.net> wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koenig at amd.com>
> wrote:
>
>> Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
>>
>> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig at amd.com>
>> wrote:
>>
>>> [SNIP]
>>> So as long as nobody from userspace comes and says we absolutely need to
>>> optimize this use case I would rather not do it.
>>>
>>
>> This is a place where nouveau's needs are legitimately different from AMD
>> or Intel, I think. NVIDIA's command streamer model is very different from
>> AMD and Intel. On AMD and Intel, each EXEC turns into a single small
>> packet (on the order of 16B) which kicks off a command buffer. There may
>> be a bit of cache management or something around it but that's it. From
>> there, it's userspace's job to make one command buffer chain to another
>> until it's finally done and then do a "return", whatever that looks like.
>>
>> NVIDIA's model is much more static. Each packet in the HW/FW ring is an
>> address and a size and that much data is processed and then it grabs the
>> next packet and processes. The result is that, if we use multiple buffers
>> of commands, there's no way to chain them together. We just have to pass
>> the whole list of buffers to the kernel.
>>
>>
>> So far that is actually completely identical to what AMD has.
>>
>> A single EXEC ioctl / job may have 500 such addr+size packets depending
>> on how big the command buffer is.
>>
>>
>> And that is what I don't understand. Why would you need 100dreds of such
>> addr+size packets?
>>
>
> Well, we're not really in control of it. We can control our base pushbuf
> size and that's something we can tune but we're still limited by the
> client. We have to submit another pushbuf whenever:
>
> 1. We run out of space (power-of-two growth is also possible but the size
> is limited to a maximum of about 4MiB due to hardware limitations.)
> 2. The client calls a secondary command buffer.
> 3. Any usage of indirect draw or dispatch on pre-Turing hardware.
>
> At some point we need to tune our BO size a bit to avoid (1) while also
> avoiding piles of tiny BOs. However, (2) and (3) are out of our control.
>
> This is basically identical to what AMD has (well on newer hw there is an
>> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't
>> widely used as far as I know).
>>
>
> According to Bas, RADV chains on recent hardware.
>
well:
1) on GFX6 and older we can't chain at all
2) on Compute/DMA we can't chain at all
3) with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT we can't chain between
cmdbuffers
4) for some secondary use cases we can't chain.
so we have to do the "submit multiple" dance in many cases.
>
>
>> Previously the limit was something like 4 which we extended to because
>> Bas came up with similar requirements for the AMD side from RADV.
>>
>> But essentially those approaches with 100dreds of IBs doesn't sound like
>> a good idea to me.
>>
>
> No one's arguing that they like it. Again, the hardware isn't designed to
> have a kernel in the way. It's designed to be fed by userspace. But we're
> going to have the kernel in the middle for a while so we need to make it
> not suck too bad.
>
> ~Faith
>
> It gets worse on pre-Turing hardware where we have to split the batch for
>> every single DrawIndirect or DispatchIndirect.
>>
>> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
>> model if you assume that userspace is feeding the firmware. When that's
>> happening, you just have a userspace thread that sits there and feeds the
>> ringbuffer with whatever is next and you can marshal as much data through
>> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets
>> launched from the FW ring and has all the individual launch commands but
>> it's not at all necessary.
>>
>> What does that mean from a gpu_scheduler PoV? Basically, it means a
>> variable packet size.
>>
>> What does this mean for implementation? IDK. One option would be to
>> teach the scheduler about actual job sizes. Another would be to virtualize
>> it and have another layer underneath the scheduler that does the actual
>> feeding of the ring. Another would be to decrease the job size somewhat and
>> then have the front-end submit as many jobs as it needs to service
>> userspace and only put the out-fences on the last job. All the options
>> kinda suck.
>>
>>
>> Yeah, agree. The job size Danilo suggested is still the least painful.
>>
>> Christian.
>>
>>
>> ~Faith
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20230824/5513f126/attachment.htm>
More information about the Intel-xe
mailing list