[PATCH v4 04/40] drm/sched: Add enqueue credit limit

Tue May 20 17:22:54 UTC 2025

On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr at kernel.org> wrote:
>
> On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > >
> > > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > > > >
> > > > > > > (Cc: Boris)
> > > > > > >
> > > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > > > waiting for the whole thing to finish before returning.
> > > > > > >
> > > > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > > > waits for the out-fence synchronously.
> > > > > >
> > > > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > > > in normal cases, otherwise with drm native context for using native
> > > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > > > thread.
> > > > > >
> > > > > > The key is we want to keep it async in the normal cases, and not have
> > > > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > > > >
> > > > > I really wonder why they don't blow up in Nouveau, which also support full
> > > > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> > > >
> > > > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
> > >
> > > The test above is part of the smoke testing I do for nouveau, but I haven't seen
> > > such issues yet for nouveau.
> >
> > nouveau is probably not using async binds for everything?  Or maybe
> > I'm just pointing to the wrong test.
>
> Let me double check later on.
>
> > > > but I might be mixing that up, I'd have to back out this patch and see
> > > > where things blow up, which would take many hours.
> > >
> > > Well, you said that you never had this issue with "real" workloads, but only
> > > with VK CTS, so I really think we should know what we are trying to fix here.
> > >
> > > We can't just add new generic infrastructure without reasonable and *well
> > > understood* justification.
> >
> > What is not well understood about this?  We need to pre-allocate
> > memory that we likely don't need for pagetables.
> >
> > In the worst case, a large # of async PAGE_SIZE binds, you end up
> > needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
> > of mapping.  Queue up enough of those and you can explode your memory
> > usage.
>
> Well, the general principle how this can OOM is well understood, sure. What's
> not well understood is how we run in this case. I think we should also
> understand what test causes the issue and why other drivers are not affected
> (yet).
>
> > > > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > > > so absolutely throttling like this is needed.
> > >
> > > I still don't understand why the kernel must throttle this? If userspace uses
> > > async VM_BIND, it obviously can't spam the kernel infinitely without running
> > > into an OOM case.
> >
> > It is a valid question about whether the kernel or userspace should be
> > the one to do the throttling.
> >
> > I went for doing it in the kernel because the kernel has better
> > knowledge of how much it needs to pre-allocate.
> >
> > (There is also the side point, that this pre-allocated memory is not
> > charged to the calling process from a PoV of memory accounting.  So
> > with that in mind it seems like a good idea for the kernel to throttle
> > memory usage.)
>
> That's a very valid point, maybe we should investigate in the direction of
> addressing this, rather than trying to work around it in the scheduler, where we
> can only set an arbitrary credit limit.

Perhaps.. but that seems like a bigger can of worms

> > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > through async VM_BIND, then the proposed solution seems wrong:
> > >
> > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > semi-sync?
> > >
> > > How do we choose this number of jobs? A very small number to be safe, which
> > > scales badly on powerful machines? A large number that scales well on powerful
> > > machines, but OOMs on weaker ones?
> >
> > The way I am using it in msm, the credit amount and limit are in units
> > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > 1024 pages, once there are jobs queued up exceeding that limit, they
> > start blocking.
> >
> > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
>
> That doesn't make a difference for my question. How do you know 1024 pages is a
> good value? How do we scale for different machines with different capabilities?
>
> If you have a powerful machine with lots of memory, we might throttle userspace
> for no reason, no?
>
> If the machine has very limited resources, it might already be too much?

It may be a bit arbitrary, but then again I'm not sure that userspace
is in any better position to pick an appropriate limit.

4MB of in-flight pages isn't going to be too much for anything that is
capable enough to run vk, but still allows for a lot of in-flight
maps.  As I mentioned before, I don't expect anyone to hit this case
normally, unless they are just trying to poke the driver in weird
ways.  Having the kernel guard against that doesn't seem unreasonable.

BR,
-R