<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <<a href="mailto:faith@gfxstrand.net">faith@gfxstrand.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 22, 2023 at 4:51 AM Christian König <<a href="mailto:christian.koenig@amd.com" target="_blank">christian.koenig@amd.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
Am 21.08.23 um 21:46 schrieb Faith Ekstrand:<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Aug 21, 2023 at
1:13 PM Christian König <<a href="mailto:christian.koenig@amd.com" target="_blank">christian.koenig@amd.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">[SNIP]<br>
So as long as nobody from userspace comes and says we
absolutely need to <br>
optimize this use case I would rather not do it.<br>
</blockquote>
<div><br>
</div>
<div>This is a place where nouveau's needs are legitimately
different from AMD or Intel, I think. NVIDIA's command
streamer model is very different from AMD and Intel. On AMD
and Intel, each EXEC turns into a single small packet (on
the order of 16B) which kicks off a command buffer. There
may be a bit of cache management or something around it but
that's it. From there, it's userspace's job to make one
command buffer chain to another until it's finally done and
then do a "return", whatever that looks like. </div>
<div><br>
</div>
<div>NVIDIA's model is much more static. Each packet in the
HW/FW ring is an address and a size and that much data is
processed and then it grabs the next packet and processes.
The result is that, if we use multiple buffers of commands,
there's no way to chain them together. We just have to pass
the whole list of buffers to the kernel.</div>
</div>
</div>
</blockquote>
<br>
So far that is actually completely identical to what AMD has.<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div>A single EXEC ioctl / job may have 500 such addr+size
packets depending on how big the command buffer is.</div>
</div>
</div>
</blockquote>
<br>
And that is what I don't understand. Why would you need 100dreds of
such addr+size packets?<br></div></blockquote><div><br></div><div>Well, we're not really in control of it. We can control our base pushbuf size and that's something we can tune but we're still limited by the client. We have to submit another pushbuf whenever:</div><div><br></div><div> 1. We run out of space (power-of-two growth is also possible but the size is limited to a maximum of about 4MiB due to hardware limitations.)<br></div><div> 2. The client calls a secondary command buffer.</div><div> 3. Any usage of indirect draw or dispatch on pre-Turing hardware.</div><div><br></div><div>At some point we need to tune our BO size a bit to avoid (1) while also avoiding piles of tiny BOs. However, (2) and (3) are out of our control.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
This is basically identical to what AMD has (well on newer hw there
is an extension in the CP packets to JUMP/CALL subsequent IBs, but
this isn't widely used as far as I know).<br></div></blockquote><div><br></div><div>According to Bas, RADV chains on recent hardware.<br></div></div></div></blockquote><div> </div><div>well:</div><div><br></div><div>1) on GFX6 and older we can't chain at all<br></div><div>2) on Compute/DMA we can't chain at all<br></div><div>3) with <code>VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT</code> we can't chain between cmdbuffers<br></div><div>4) for some secondary use cases we can't chain.</div><div><br></div><div>so we have to do the "submit multiple" dance in many cases.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
Previously the limit was something like 4 which we extended to
because Bas came up with similar requirements for the AMD side from
RADV.<br>
<br>
But essentially those approaches with 100dreds of IBs doesn't sound
like a good idea to me.<br></div></blockquote><div><br></div><div>No one's arguing that they like it. Again, the hardware isn't designed to have a kernel in the way. It's designed to be fed by userspace. But we're going to have the kernel in the middle for a while so we need to make it not suck too bad.<br></div><div><br></div><div>~Faith<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div>It gets worse on pre-Turing hardware where we have to
split the batch for every single DrawIndirect or
DispatchIndirect.</div>
<div><br>
</div>
<div>Lest you think NVIDIA is just crazy here, it's a
perfectly reasonable model if you assume that userspace is
feeding the firmware. When that's happening, you just have
a userspace thread that sits there and feeds the ringbuffer
with whatever is next and you can marshal as much data
through as you want. Sure, it'd be nice to have a 2nd level
batch thing that gets launched from the FW ring and has all
the individual launch commands but it's not at all
necessary.</div>
<div><br>
</div>
<div>What does that mean from a gpu_scheduler PoV? Basically,
it means a variable packet size.<br>
</div>
<div><br>
</div>
<div>What does this mean for implementation? IDK. One option
would be to teach the scheduler about actual job sizes.
Another would be to virtualize it and have another layer
underneath the scheduler that does the actual feeding of the
ring. Another would be to decrease the job size somewhat and
then have the front-end submit as many jobs as it needs to
service userspace and only put the out-fences on the last
job. All the options kinda suck.</div>
</div>
</div>
</blockquote>
<br>
Yeah, agree. The job size Danilo suggested is still the least
painful.<br>
<br>
Christian.<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>~Faith<br>
</div>
</div>
</div>
</blockquote>
<br>
</div>
</blockquote></div></div>
</blockquote></div></div>