[PATCH v4 04/40] drm/sched: Add enqueue credit limit

Fri May 23 02:31:28 UTC 2025

On Thu, May 22, 2025 at 8:53 AM Danilo Krummrich <dakr at kernel.org> wrote:
>
> On Thu, May 22, 2025 at 07:47:17AM -0700, Rob Clark wrote:
> > On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote:
> > > > On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > > > > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > > > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > > > > > through async VM_BIND, then the proposed solution seems wrong:
> > > > > > >
> > > > > > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > > > > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > > > > > semi-sync?
> > > > > > >
> > > > > > > How do we choose this number of jobs? A very small number to be safe, which
> > > > > > > scales badly on powerful machines? A large number that scales well on powerful
> > > > > > > machines, but OOMs on weaker ones?
> > > > > >
> > > > > > The way I am using it in msm, the credit amount and limit are in units
> > > > > > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > > > > > 1024 pages, once there are jobs queued up exceeding that limit, they
> > > > > > start blocking.
> > > > > >
> > > > > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
> > > > >
> > > > > That doesn't make a difference for my question. How do you know 1024 pages is a
> > > > > good value? How do we scale for different machines with different capabilities?
> > > > >
> > > > > If you have a powerful machine with lots of memory, we might throttle userspace
> > > > > for no reason, no?
> > > > >
> > > > > If the machine has very limited resources, it might already be too much?
> > > >
> > > > It may be a bit arbitrary, but then again I'm not sure that userspace
> > > > is in any better position to pick an appropriate limit.
> > > >
> > > > 4MB of in-flight pages isn't going to be too much for anything that is
> > > > capable enough to run vk, but still allows for a lot of in-flight
> > > > maps.
> > >
> > > Ok, but what about the other way around? What's the performance impact if the
> > > limit is chosen rather small, but we're running on a very powerful machine?
> > >
> > > Since you already have the implementation for hardware you have access to, can
> > > you please check if and how performance degrades when you use a very small
> > > threshold?
> >
> > I mean, considering that some drivers (asahi, at least), _only_
> > implement synchronous VM_BIND, I guess blocking in extreme cases isn't
> > so bad.
>
> Which is not even upstream yet and eventually will support async VM_BIND too,
> AFAIK.

the uapi is upstream

> > But I think you are overthinking this.  4MB of pagetables is
> > enough to map ~8GB of buffers.
> >
> > Perhaps drivers would want to set their limit based on the amount of
> > memory the GPU could map, which might land them on a # larger than
> > 1024, but still not an order of magnitude more.
>
> Nouveau currently supports an address space width of 128TiB.
>
> In general, we have to cover the range of some small laptop or handheld devices
> to huge datacenter machines.

sure.. and?  It is still up to the user of sched to set their own
limits, I'm not proposing that sched takes charge of that policy

Maybe msm doesn't have to scale up quite as much (yet).. but it has to
scale quite a bit further down (like watches).  In the end it is the
same.  And also not really the point here.

> > I don't really have a good setup for testing games that use this, atm,
> > fex-emu isn't working for me atm.  But I think Connor has a setup with
> > proton working?
>
> I just want to be sure that an arbitrary small limit doing the job for a small
> device to not fail VK CTS can't regress the performance on large machines.

why are we debating the limit I set outside of sched.. even that might
be subject to some tuning for devices that have more memory, but that
really outside the scope of this patch

> So, kindly try to prove that we're not prone to extreme performance regression
> with a static value as you propose.
>
> > > Also, I think we should probably put this throttle mechanism in a separate
> > > component, that just wraps a counter of bytes or rather pages that can be
> > > increased and decreased through an API and the increase just blocks at a certain
> > > threshold.
> >
> > Maybe?  I don't see why we need to explicitly define the units for the
> > credit.  This wasn't done for the existing credit mechanism.. which,
> > seems like if you used some extra fences could also have been
> > implemented externally.
>
> If you are referring to the credit mechanism in the scheduler for ring buffers,
> that's a different case. Drivers know the size of their ring buffers exactly and
> the scheduler has the responsibility of when to submit tasks to the ring buffer.
> So the scheduler kind of owns the resource.
>
> However, the throttle mechanism you propose is independent from the scheduler,
> it depends on the available system memory, a resource the scheduler doesn't own.

it is a distinction that is perhaps a matter of opinion.  I don't see
such a big difference, it is all just a matter of managing physical
resource usage in different stages of a scheduled job's lifetime.

> I'm fine to make the unit credits as well, but in this case we really care about
> the consumption of system memory, so we could just use an applicable unit.
>
> > > This component can then be called by a driver from the job submit IOCTL and the
> > > corresponding place where the pre-allocated memory is actually used / freed.
> > >
> > > Depending on the driver, this might not necessarily be in the scheduler's
> > > run_job() callback.
> > >
> > > We could call the component something like drm_throttle or drm_submit_throttle.
> >
> > Maybe?  This still has the same complaint I had about just
> > implementing this in msm.. it would have to reach in and use the
> > scheduler's job_scheduled wait-queue.  Which, to me at least, seems
> > like more of an internal detail about how the scheduler works.
>
> Why? The component should use its own waitqueue. Subsequently, from your code
> that releases the pre-allocated memory, you can decrement the counter through
> the drm_throttle API, which automatically kicks its the waitqueue.
>
> For instance from your VM_BIND IOCTL you can call
>
>         drm_throttle_inc(value)
>
> which blocks if the increment goes above the threshold. And when you release the
> pre-allocated memory you call
>
>         drm_throttle_dec(value)
>
> which wakes the waitqueue and unblocks the drm_throttle_inc() call from your
> VM_BIND IOCTL.

ok, sure, we could introduce another waitqueue, but with my proposal
that is not needed.  And like I said, the existing throttling could
also be implemented externally to the scheduler..  so I'm not seeing
any fundamental difference.

> Another advantage is that, if necessary, we can make drm_throttle
> (automatically) scale for the machines resources, which otherwise we'd need to
> pollute the scheduler with.

How is this different from drivers being more sophisticated about
picking the limit we configure the scheduler with?

Sure, maybe just setting a hard coded limit of 1024 might not be the
final solution.. maybe we should take into consideration the size of
the device.  But this is also entirely outside of the scheduler and I
fail to understand why we are discussing this here?

BR,
-R