[PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2

Wed Feb 14 08:23:00 UTC 2024

Am 13.02.24 um 18:37 schrieb Danilo Krummrich:
> Hi Christian,
>
> What's the status of this effort? Was there ever a follow-up?

It's unfortunately on hold for the moment since I have to look into some 
internal things with highest priority. No idea when this will calm down 
again.

Christian.

>
> - Danilo
>
> On 11/16/23 23:23, Danilo Krummrich wrote:
>> Hi Christian,
>>
>> Thanks for sending an update of this patch!
>>
>> On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
>>> Start to improve the scheduler document. Especially document the
>>> lifetime of each of the objects as well as the restrictions around
>>> DMA-fence handling and userspace compatibility.
>>>
>>> v2: Some improvements suggested by Danilo, add section about error
>>>      handling.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> ---
>>>   Documentation/gpu/drm-mm.rst           |  36 +++++
>>>   drivers/gpu/drm/scheduler/sched_main.c | 174 
>>> +++++++++++++++++++++----
>>>   2 files changed, 188 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst 
>>> b/Documentation/gpu/drm-mm.rst
>>> index acc5901ac840..112463fa9f3a 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,12 +552,48 @@ Overview
>>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Overview
>>>   +Job Object
>>> +----------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Job Object
>>> +
>>> +Entity Object
>>> +-------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Entity Object
>>> +
>>> +Hardware Fence Object
>>> +---------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Hardware Fence Object
>>> +
>>> +Scheduler Fence Object
>>> +----------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Scheduler Fence Object
>>> +
>>> +Scheduler and Run Queue Objects
>>> +-------------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Scheduler and Run Queue Objects
>>> +
>>>   Flow Control
>>>   ------------
>>>     .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Flow Control
>>>   +Error and Timeout handling
>>> +--------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Error and Timeout handling
>>> +
>>>   Scheduler Function References
>>>   -----------------------------
>>>   diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..026123497b0e 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -24,28 +24,122 @@
>>>   /**
>>>    * DOC: Overview
>>>    *
>>> - * The GPU scheduler provides entities which allow userspace to 
>>> push jobs
>>> - * into software queues which are then scheduled on a hardware run 
>>> queue.
>>> - * The software queues have a priority among them. The scheduler 
>>> selects the entities
>>> - * from the run queue using a FIFO. The scheduler provides 
>>> dependency handling
>>> - * features among jobs. The driver is supposed to provide callback 
>>> functions for
>>> - * backend operations to the scheduler like submitting a job to 
>>> hardware run queue,
>>> - * returning the dependencies of a job etc.
>>> - *
>>> - * The organisation of the scheduler is the following:
>>> - *
>>> - * 1. Each hw run queue has one scheduler
>>> - * 2. Each scheduler has multiple run queues with different priorities
>>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>>> - * 3. Each scheduler run queue has a queue of entities to schedule
>>> - * 4. Entities themselves maintain a queue of jobs that will be 
>>> scheduled on
>>> - *    the hardware.
>>> - *
>>> - * The jobs in a entity are always scheduled in the order that they 
>>> were pushed.
>>> - *
>>> - * Note that once a job was taken from the entities queue and 
>>> pushed to the
>>> - * hardware, i.e. the pending queue, the entity must not be 
>>> referenced anymore
>>> - * through the jobs entity pointer.
>>> + * The GPU scheduler implements some logic to decide which command 
>>> submission
>>> + * to push next to the hardware. Another major use case of the GPU 
>>> scheduler
>>> + * is to enforce correct driver behavior around those command 
>>> submissions.
>>> + * Because of this it's also used by drivers which don't need the 
>>> actual
>>> + * scheduling functionality.
>>
>> This reads a bit like we're already right in the middle of the 
>> documentation.
>> I'd propose to start with something like "The DRM GPU scheduler 
>> serves as a
>> common component intended to help drivers to manage command 
>> submissions." And
>> then mention the different solutions the scheduler provides, e.g. 
>> ordering of
>> command submissions, dma-fence handling in the context of command 
>> submissions,
>> etc.
>>
>> Also, I think it would be good to give a rough overview of the 
>> topology of the
>> scheduler's components. And since you already mention that the 
>> component is
>> "also used by drivers which don't need the actual scheduling 
>> functionality", I'd
>> also mention the special case of a single entity and a single 
>> run-queue per
>> scheduler.
>>
>>> + *
>>> + * All callbacks the driver needs to implement are restricted by 
>>> DMA-fence
>>> + * signaling rules to guarantee deadlock free forward progress. 
>>> This especially
>>> + * means that for normal operation no memory can be allocated in a 
>>> callback.
>>> + * All memory which is needed for pushing the job to the hardware 
>>> must be
>>> + * allocated before arming a job. It also means that no locks can 
>>> be taken
>>> + * under which memory might be allocated as well.
>>
>> I think that's good. Even though, with the recently merged workqueue 
>> patches,
>> drivers can actually create a setup where the free_job callback isn't 
>> part of
>> the fence signalling critical path anymore. But I agree with Sima 
>> that this is
>> probably too error prone to give drivers ideas. So, this paragraph is 
>> probably
>> good as it is. :-)
>>
>>> + *
>>> + * Memory which is optional to allocate, for example for device 
>>> core dumping or
>>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate 
>>> error
>>> + * handling taking if that allocation fails. GFP_ATOMIC should only 
>>> be used if
>>> + * absolutely necessary since dipping into the special atomic 
>>> reserves is
>>> + * usually not justified for a GPU driver.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Job Object
>>> + *
>>> + * The base job object contains submission dependencies in the form 
>>> of DMA-fence
>>> + * objects. Drivers can also implement an optional prepare_job 
>>> callback which
>>> + * returns additional dependencies as DMA-fence objects. It's 
>>> important to note
>>> + * that this callback can't allocate memory or grab locks under 
>>> which memory is
>>> + * allocated.
>>> + *
>>> + * Drivers should use this as base class for an object which 
>>> contains the
>>> + * necessary state to push the command submission to the hardware.
>>> + *
>>> + * The lifetime of the job object should at least be from pushing 
>>> it into the
>>
>> "should at least last from"
>>
>>> + * scheduler until the scheduler notes through the free callback 
>>> that a job
>>
>> What about "until the free_job callback has been called and hence the 
>> scheduler
>> does not require the job object anymore."?
>>
>>> + * isn't needed any more. Drivers can of course keep their job 
>>> object alive
>>> + * longer than that, but that's outside of the scope of the scheduler
>>> + * component. Job initialization is split into two parts, 
>>> drm_sched_job_init()
>>> + * and drm_sched_job_arm(). It's important to note that after 
>>> arming a job
>>
>> I suggest to add a brief comment on why job initialization is split up.
>>
>>> + * drivers must follow the DMA-fence rules and can't easily 
>>> allocate memory
>>> + * or takes locks under which memory is allocated.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Entity Object
>>> + *
>>> + * The entity object which is a container for jobs which should 
>>> execute
>>> + * sequentially. Drivers should create an entity for each 
>>> individual context
>>> + * they maintain for command submissions which can run in parallel.
>>> + *
>>> + * The lifetime of the entity should *not* exceed the lifetime of the
>>> + * userspace process it was created for and drivers should call the
>>> + * drm_sched_entity_flush() function from their file_operations.flush
>>> + * callback. So it's possible that an entity object is not alive any
>>
>> "Note that it is possible..."
>>
>>> + * more while jobs from it are still running on the hardware.
>>
>> "while jobs previously fetched from this entity are still..."
>>
>>> + *
>>> + * Background is that for compatibility reasons with existing
>>> + * userspace all results of a command submission should become visible
>>> + * externally even after after a process exits. This is normal 
>>> POSIX behavior
>>> + * for I/O operations.
>>> + *
>>> + * The problem with this approach is that GPU submissions contain 
>>> executable
>>> + * shaders enabling processes to evade their termination by 
>>> offloading work to
>>> + * the GPU. So when a process is terminated with a SIGKILL the 
>>> entity object
>>> + * makes sure that jobs are freed without running them while still 
>>> maintaining
>>> + * correct sequential order for signaling fences.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Hardware Fence Object
>>> + *
>>> + * The hardware fence object is a DMA-fence provided by the driver 
>>> as result of
>>> + * running jobs. Drivers need to make sure that the normal 
>>> DMA-fence semantics
>>> + * are followed for this object. It's important to note that the 
>>> memory for
>>> + * this object can *not* be allocated in the run_job callback since 
>>> that would
>>> + * violate the requirements for the DMA-fence implementation. The 
>>> scheduler
>>> + * maintains a timeout handler which triggers if this fence doesn't 
>>> signal in
>>> + * a configurable time frame.
>>> + *
>>> + * The lifetime of this object follows DMA-fence ref-counting 
>>> rules, the
>>> + * scheduler takes ownership of the reference returned by the 
>>> driver and drops
>>> + * it when it's not needed any more.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler Fence Object
>>> + *
>>> + * The scheduler fence object which encapsulates the whole time 
>>> from pushing
>>> + * the job into the scheduler until the hardware has finished 
>>> processing it.
>>> + * This is internally managed by the scheduler, but drivers can 
>>> grab additional
>>> + * reference to it after arming a job. The implementation provides 
>>> DMA-fence
>>> + * interfaces for signaling both scheduling of a command submission 
>>> as well as
>>> + * finishing of processing.
>>> + *
>>> + * The lifetime of this object also follows normal DMA-fence 
>>> ref-counting
>>> + * rules. The finished fence is the one normally exposed outside of 
>>> the
>>> + * scheduler, but the driver can grab references to both the 
>>> scheduled as well
>>> + * as the finished fence when needed for pipe-lining optimizations.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler and Run Queue Objects
>>> + *
>>> + * The scheduler object itself does the actual work of selecting a 
>>> job and
>>> + * pushing it to the hardware. Both FIFO and RR selection algorithm 
>>> are
>>> + * supported, but FIFO is preferred for many use cases.
>>
>> I suggest to name the use cases FIFO scheduling is preferred for and 
>> why.
>>
>> If, instead, it's just a general recommendation, I also suggest to 
>> explain why.
>>
>>> + *
>>> + * The lifetime of the scheduler is managed by the driver using it. 
>>> Before
>>> + * destroying the scheduler the driver must ensure that all 
>>> hardware processing
>>> + * involving this scheduler object has finished by calling for example
>>> + * disable_irq(). It is *not* sufficient to wait for the hardware 
>>> fence here
>>> + * since this doesn't guarantee that all callback processing has 
>>> finished.
>>
>> This is the part I'm most concerned about, since I feel like we leave 
>> drivers
>> "up in the air" entirely. Hence, I think here we need to be more 
>> verbose and
>> detailed about the options drivers have to ensure that.
>>
>> For instance, let's assume we have the single-entity-per-scheduler 
>> topology
>> because the driver only uses the GPU scheduler to feed a firmware 
>> scheduler with
>> dynamically allocated ring buffers.
>>
>> In this case the entity, scheduler and ring buffer are bound to the 
>> lifetime of
>> a userspace process.
>>
>> What do we expect the driver to do if the userspace process is 
>> killed? As you
>> mentioned, only waiting for the ring to be idle (which implies all HW 
>> fences
>> are signalled) is not enough. This doesn't guarantee all the free_job()
>> callbacks have been called yet and hence stopping the scheduler 
>> before the
>> pending_list is actually empty would leak the memory of the jobs on the
>> pending_list waiting to be freed.
>>
>> I already brought this up when we were discussing Matt's Xe inspired 
>> scheduler
>> patch series and it seems there was no interest to provide drivers 
>> with some
>> common mechanism that gurantees that the pending_list is empty. 
>> Hence, I really
>> think we should at least give recommendations how drivers should deal 
>> with that.
>>
>>> + *
>>> + * The run queue object is a container of entities for a certain 
>>> priority
>>> + * level. This object is internally managed by the scheduler and 
>>> drivers
>>> + * shouldn't touch them directly. The lifetime of run queues are 
>>> bound to the
>>> + * schedulers lifetime.
>>
>> I think we should also mention that we support a variable number of 
>> run-queues
>> up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction 
>> on which
>> priorities a driver can use when choosing less than 
>> DRM_SCHED_PRIORITY_COUNT
>> run-queues.
>>
>> For instance, initializing the scheduler with a single run-queue, 
>> requires the
>> corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise 
>> we'll just
>> fault since the priority is also used as an array index into 
>> sched->sched_rq[].
>>
>>>    */
>>>     /**
>>> @@ -72,6 +166,42 @@
>>>    * limit.
>>>    */
>>>   +/**
>>> + * DOC: Error and Timeout handling
>>> + *
>>> + * Errors schould be signaled by using dma_fence_set_error() on the 
>>> hardware
>>> + * fence object before signaling it. Errors are then bubbled up 
>>> from the
>>> + * hardware fence to the scheduler fence.
>>> + *
>>> + * The entity allows querying errors on the last run submission 
>>> using the
>>> + * drm_sched_entity_error() function which can be used to cancel 
>>> queued
>>> + * submissions in the run_job callback as well as preventing 
>>> pushing further
>>> + * ones into the entity in the drivers submission function.
>>> + *
>>> + * When the hardware fence fails to signal in a configurable amount 
>>> of time the
>>> + * timedout_job callback is issued. The driver should then follow 
>>> the procedure
>>> + * described on the &struct drm_sched_backend_ops.timedout_job 
>>> callback (TODO:
>>> + * The timeout handler should probably switch to using the hardware 
>>> fence as
>>> + * parameter instead of the job. Otherwise the handling will always 
>>> race
>>> + * between timing out and signaling the fence).
>>> + *
>>> + * The scheduler also used to provided functionality for 
>>> re-submitting jobs
>>> + * with replacing the hardware fence during reset handling. This 
>>> functionality
>>> + * is now marked as deprecated. This has proven to be fundamentally 
>>> racy and
>>> + * not compatible with DMA-fence rules and shouldn't be used in any 
>>> new code.
>>> + *
>>> + * Additional there is the drm_sched_increase_karma() function 
>>> which tries to
>>   "Additionally"
>>
>>> + * find the entity which submitted a job and increases it's 'karma'
>>> + * atomic variable to prevent re-submitting jobs from this entity. 
>>> This has
>>> + * quite some overhead and re-submitting jobs is now marked as 
>>> deprecated. So
>>> + * using this function is rather discouraged.
>>> + *
>>> + * Drivers can still re-create the GPU state should it be lost 
>>> during timeout
>>> + * handling when they can guarantee that forward progress is made 
>>> and this
>>> + * doesn't cause another timeout. But this is strongly hardware 
>>> specific and
>>> + * out of the scope of the general GPU scheduler.
>>> + */
>>> +
>>>   #include <linux/wait.h>
>>>   #include <linux/sched.h>
>>>   #include <linux/completion.h>
>>> -- 
>>> 2.34.1
>>>
>