[PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Tue Sep 12 16:58:23 UTC 2023

On 9/12/23 17:13, Boris Brezillon wrote:
> On Tue, 12 Sep 2023 16:49:09 +0200
> Boris Brezillon <boris.brezillon at collabora.com> wrote:
> 
>> On Tue, 12 Sep 2023 16:33:01 +0200
>> Danilo Krummrich <dakr at redhat.com> wrote:
>>
>>> On 9/12/23 16:28, Boris Brezillon wrote:
>>>> On Thu, 17 Aug 2023 13:13:31 +0200
>>>> Danilo Krummrich <dakr at redhat.com> wrote:
>>>>      
>>>>> I think that's a misunderstanding. I'm not trying to say that it is
>>>>> *always* beneficial to fill up the ring as much as possible. But I think
>>>>> it is under certain circumstances, exactly those circumstances I
>>>>> described for Nouveau.
>>>>>
>>>>> As mentioned, in Nouveau the size of a job is only really limited by the
>>>>> ring size, which means that one job can (but does not necessarily) fill
>>>>> up the whole ring. We both agree that this is inefficient, because it
>>>>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>>>>
>>>>> I recognize you said that one should define hw_submission_limit and
>>>>> adjust the other parts of the equation accordingly, the options I see are:
>>>>>
>>>>> (1) Increase the ring size while keeping the maximum job size.
>>>>> (2) Decrease the maximum job size while keeping the ring size.
>>>>> (3) Let the scheduler track the actual job size rather than the maximum
>>>>> job size.
>>>>>
>>>>> (1) results into potentially wasted ring memory, because we're not
>>>>> always reaching the maximum job size, but the scheduler assumes so.
>>>>>
>>>>> (2) results into more IOCTLs from userspace for the same amount of IBs
>>>>> and more jobs result into more memory allocations and more work being
>>>>> submitted to the workqueue (with Matt's patches).
>>>>>
>>>>> (3) doesn't seem to have any of those draw backs.
>>>>>
>>>>> What would be your take on that?
>>>>>
>>>>> Actually, if none of the other drivers is interested into a more precise
>>>>> way of keeping track of the ring utilization, I'd be totally fine to do
>>>>> it in a driver specific way. However, unfortunately I don't see how this
>>>>> would be possible.
>>>>
>>>> I'm not entirely sure, but I think PowerVR is pretty close to your
>>>> description: jobs size is dynamic size, and the ring buffer size is
>>>> picked by the driver at queue initialization time. What we did was to
>>>> set hw_submission_limit to an arbitrarily high value of 64k (we could
>>>> have used something like ringbuf_size/min_job_size instead), and then
>>>> have the control flow implemented with ->prepare_job() [1] (CCCB is the
>>>> PowerVR ring buffer). This allows us to maximize ring buffer utilization
>>>> while still allowing dynamic-size jobs.
>>>
>>> I guess this would work, but I think it would be better to bake this in,
>>> especially if more drivers do have this need. I already have an
>>> implementation [1] for doing that in the scheduler. My plan was to push
>>> that as soon as Matt sends out V3.
>>>
>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95
>>
>> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
>> native fences waits are passed to the FW, and those add to the job size.
>> When we know our job is ready for execution (all non-native deps are
>> signaled), we evict already signaled native-deps (or native fences) to
>> shrink the job size further more, but that's something we need to
>> calculate late if we want the job size to be minimal. Of course, we can
>> always over-estimate the job size, but if we go for a full-blown
>> drm_sched integration, I wonder if it wouldn't be preferable to have a
>> ->get_job_size() callback returning the number of units needed by job,
>> and have the core pick 1 when the hook is not implemented.
> 
> FWIW, I think last time I asked how to do that, I've been pointed to
> ->prepare_job() by someone  (don't remember if it was Daniel or
> Christian), hence the PowerVR implementation. If that's still the
> preferred solution, there's some opportunity to have a generic layer to
> automate ringbuf utilization tracking and some helpers to prepare
> wait_for_ringbuf dma_fences that drivers could return from
> ->prepare_job() (those fences would then be signaled when the driver
> calls drm_ringbuf_job_done() and the next job waiting for ringbuf space
> now fits in the ringbuf).
> 

Not sure I like that, it's basically a different implementation to work
around limitations of an implementation that is supposed to cover this case
in general.