[PATCH v4 04/40] drm/sched: Add enqueue credit limit

Tue May 20 16:07:05 UTC 2025

On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr at kernel.org> wrote:
>
> On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > >
> > > > > (Cc: Boris)
> > > > >
> > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > waiting for the whole thing to finish before returning.
> > > > >
> > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > waits for the out-fence synchronously.
> > > >
> > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > in normal cases, otherwise with drm native context for using native
> > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > thread.
> > > >
> > > > The key is we want to keep it async in the normal cases, and not have
> > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > >
> > > I really wonder why they don't blow up in Nouveau, which also support full
> > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> >
> > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
>
> The test above is part of the smoke testing I do for nouveau, but I haven't seen
> such issues yet for nouveau.

nouveau is probably not using async binds for everything?  Or maybe
I'm just pointing to the wrong test.

> > but I might be mixing that up, I'd have to back out this patch and see
> > where things blow up, which would take many hours.
>
> Well, you said that you never had this issue with "real" workloads, but only
> with VK CTS, so I really think we should know what we are trying to fix here.
>
> We can't just add new generic infrastructure without reasonable and *well
> understood* justification.

What is not well understood about this?  We need to pre-allocate
memory that we likely don't need for pagetables.

In the worst case, a large # of async PAGE_SIZE binds, you end up
needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
of mapping.  Queue up enough of those and you can explode your memory
usage.

> > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > so absolutely throttling like this is needed.
>
> I still don't understand why the kernel must throttle this? If userspace uses
> async VM_BIND, it obviously can't spam the kernel infinitely without running
> into an OOM case.

It is a valid question about whether the kernel or userspace should be
the one to do the throttling.

I went for doing it in the kernel because the kernel has better
knowledge of how much it needs to pre-allocate.

(There is also the side point, that this pre-allocated memory is not
charged to the calling process from a PoV of memory accounting.  So
with that in mind it seems like a good idea for the kernel to throttle
memory usage.)

> But let's assume we agree that we want to avoid that userspace can ever OOM itself
> through async VM_BIND, then the proposed solution seems wrong:
>
> Do we really want the driver developer to set an arbitrary boundary of a number
> of jobs that can be submitted before *async* VM_BIND blocks and becomes
> semi-sync?
>
> How do we choose this number of jobs? A very small number to be safe, which
> scales badly on powerful machines? A large number that scales well on powerful
> machines, but OOMs on weaker ones?

The way I am using it in msm, the credit amount and limit are in units
of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
1024 pages, once there are jobs queued up exceeding that limit, they
start blocking.

The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.

> I really think, this isn't the correct solution, but more a workaround.
>
> > Part of the VM_BIND for msm series adds some tracepoints for amount of
> > memory preallocated vs used for each job.  That plus scheduler
> > tracepoints should let you see how much memory is tied up in
> > prealloc'd pgtables.  You might not be noticing only because you are
> > running on a big desktop with lots of RAM ;-)
> >
> > > > > > But this
> > > > > > doesn't work for native context, where everything has to be
> > > > > > asynchronous, so we're trying a new approach where we instead submit
> > > > > > an asynchronous bind for "normal" (non-sparse/driver internal)
> > > > > > allocations and only attach its out-fence to the in-fence of
> > > > > > subsequent submits to other queues.
> > > > >
> > > > > This is what nouveau does and I think other drivers like Xe and panthor do this
> > > > > as well.
> > > >
> > > > No one has added native context support for these drivers yet
> > >
> > > Huh? What exactly do you mean with "native context" then?
> >
> > It is a way to use native usermode driver in a guest VM, by remoting
> > at the UAPI level, as opposed to the vk or gl API level.  You can
> > generally get equal to native performance, but the guest/host boundary
> > strongly encourages asynchronous to hide the guest->host latency.
>
> For the context we're discussing this isn't different to other drivers supporing
> async VM_BIND utilizing it from the host, rather than from a guest.
>
> So, my original statement about nouveau, Xe, panthor doing the same thing
> without running into trouble should be valid.

Probably the difference is that we don't do any _synchronous_ binds.
And that is partially motivated by the virtual machine case.

BR,
-R