[PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2

Danilo Krummrich dakr at redhat.com
Tue Feb 13 17:37:44 UTC 2024


Hi Christian,

What's the status of this effort? Was there ever a follow-up?

- Danilo

On 11/16/23 23:23, Danilo Krummrich wrote:
> Hi Christian,
> 
> Thanks for sending an update of this patch!
> 
> On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
>> Start to improve the scheduler document. Especially document the
>> lifetime of each of the objects as well as the restrictions around
>> DMA-fence handling and userspace compatibility.
>>
>> v2: Some improvements suggested by Danilo, add section about error
>>      handling.
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>>   Documentation/gpu/drm-mm.rst           |  36 +++++
>>   drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>>   2 files changed, 188 insertions(+), 22 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index acc5901ac840..112463fa9f3a 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,12 +552,48 @@ Overview
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Overview
>>   
>> +Job Object
>> +----------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Job Object
>> +
>> +Entity Object
>> +-------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Entity Object
>> +
>> +Hardware Fence Object
>> +---------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Hardware Fence Object
>> +
>> +Scheduler Fence Object
>> +----------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler Fence Object
>> +
>> +Scheduler and Run Queue Objects
>> +-------------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler and Run Queue Objects
>> +
>>   Flow Control
>>   ------------
>>   
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Flow Control
>>   
>> +Error and Timeout handling
>> +--------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Error and Timeout handling
>> +
>>   Scheduler Function References
>>   -----------------------------
>>   
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 044a8c4875ba..026123497b0e 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,122 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push jobs
>> - * into software queues which are then scheduled on a hardware run queue.
>> - * The software queues have a priority among them. The scheduler selects the entities
>> - * from the run queue using a FIFO. The scheduler provides dependency handling
>> - * features among jobs. The driver is supposed to provide callback functions for
>> - * backend operations to the scheduler like submitting a job to hardware run queue,
>> - * returning the dependencies of a job etc.
>> - *
>> - * The organisation of the scheduler is the following:
>> - *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
>> - *    the hardware.
>> - *
>> - * The jobs in a entity are always scheduled in the order that they were pushed.
>> - *
>> - * Note that once a job was taken from the entities queue and pushed to the
>> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
>> - * through the jobs entity pointer.
>> + * The GPU scheduler implements some logic to decide which command submission
>> + * to push next to the hardware. Another major use case of the GPU scheduler
>> + * is to enforce correct driver behavior around those command submissions.
>> + * Because of this it's also used by drivers which don't need the actual
>> + * scheduling functionality.
> 
> This reads a bit like we're already right in the middle of the documentation.
> I'd propose to start with something like "The DRM GPU scheduler serves as a
> common component intended to help drivers to manage command submissions." And
> then mention the different solutions the scheduler provides, e.g. ordering of
> command submissions, dma-fence handling in the context of command submissions,
> etc.
> 
> Also, I think it would be good to give a rough overview of the topology of the
> scheduler's components. And since you already mention that the component is
> "also used by drivers which don't need the actual scheduling functionality", I'd
> also mention the special case of a single entity and a single run-queue per
> scheduler.
> 
>> + *
>> + * All callbacks the driver needs to implement are restricted by DMA-fence
>> + * signaling rules to guarantee deadlock free forward progress. This especially
>> + * means that for normal operation no memory can be allocated in a callback.
>> + * All memory which is needed for pushing the job to the hardware must be
>> + * allocated before arming a job. It also means that no locks can be taken
>> + * under which memory might be allocated as well.
> 
> I think that's good. Even though, with the recently merged workqueue patches,
> drivers can actually create a setup where the free_job callback isn't part of
> the fence signalling critical path anymore. But I agree with Sima that this is
> probably too error prone to give drivers ideas. So, this paragraph is probably
> good as it is. :-)
> 
>> + *
>> + * Memory which is optional to allocate, for example for device core dumping or
>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
>> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if
>> + * absolutely necessary since dipping into the special atomic reserves is
>> + * usually not justified for a GPU driver.
>> + */
>> +
>> +/**
>> + * DOC: Job Object
>> + *
>> + * The base job object contains submission dependencies in the form of DMA-fence
>> + * objects. Drivers can also implement an optional prepare_job callback which
>> + * returns additional dependencies as DMA-fence objects. It's important to note
>> + * that this callback can't allocate memory or grab locks under which memory is
>> + * allocated.
>> + *
>> + * Drivers should use this as base class for an object which contains the
>> + * necessary state to push the command submission to the hardware.
>> + *
>> + * The lifetime of the job object should at least be from pushing it into the
> 
> "should at least last from"
> 
>> + * scheduler until the scheduler notes through the free callback that a job
> 
> What about "until the free_job callback has been called and hence the scheduler
> does not require the job object anymore."?
> 
>> + * isn't needed any more. Drivers can of course keep their job object alive
>> + * longer than that, but that's outside of the scope of the scheduler
>> + * component. Job initialization is split into two parts, drm_sched_job_init()
>> + * and drm_sched_job_arm(). It's important to note that after arming a job
> 
> I suggest to add a brief comment on why job initialization is split up.
> 
>> + * drivers must follow the DMA-fence rules and can't easily allocate memory
>> + * or takes locks under which memory is allocated.
>> + */
>> +
>> +/**
>> + * DOC: Entity Object
>> + *
>> + * The entity object which is a container for jobs which should execute
>> + * sequentially. Drivers should create an entity for each individual context
>> + * they maintain for command submissions which can run in parallel.
>> + *
>> + * The lifetime of the entity should *not* exceed the lifetime of the
>> + * userspace process it was created for and drivers should call the
>> + * drm_sched_entity_flush() function from their file_operations.flush
>> + * callback. So it's possible that an entity object is not alive any
> 
> "Note that it is possible..."
> 
>> + * more while jobs from it are still running on the hardware.
> 
> "while jobs previously fetched from this entity are still..."
> 
>> + *
>> + * Background is that for compatibility reasons with existing
>> + * userspace all results of a command submission should become visible
>> + * externally even after after a process exits. This is normal POSIX behavior
>> + * for I/O operations.
>> + *
>> + * The problem with this approach is that GPU submissions contain executable
>> + * shaders enabling processes to evade their termination by offloading work to
>> + * the GPU. So when a process is terminated with a SIGKILL the entity object
>> + * makes sure that jobs are freed without running them while still maintaining
>> + * correct sequential order for signaling fences.
>> + */
>> +
>> +/**
>> + * DOC: Hardware Fence Object
>> + *
>> + * The hardware fence object is a DMA-fence provided by the driver as result of
>> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
>> + * are followed for this object. It's important to note that the memory for
>> + * this object can *not* be allocated in the run_job callback since that would
>> + * violate the requirements for the DMA-fence implementation. The scheduler
>> + * maintains a timeout handler which triggers if this fence doesn't signal in
>> + * a configurable time frame.
>> + *
>> + * The lifetime of this object follows DMA-fence ref-counting rules, the
>> + * scheduler takes ownership of the reference returned by the driver and drops
>> + * it when it's not needed any more.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler Fence Object
>> + *
>> + * The scheduler fence object which encapsulates the whole time from pushing
>> + * the job into the scheduler until the hardware has finished processing it.
>> + * This is internally managed by the scheduler, but drivers can grab additional
>> + * reference to it after arming a job. The implementation provides DMA-fence
>> + * interfaces for signaling both scheduling of a command submission as well as
>> + * finishing of processing.
>> + *
>> + * The lifetime of this object also follows normal DMA-fence ref-counting
>> + * rules. The finished fence is the one normally exposed outside of the
>> + * scheduler, but the driver can grab references to both the scheduled as well
>> + * as the finished fence when needed for pipe-lining optimizations.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler and Run Queue Objects
>> + *
>> + * The scheduler object itself does the actual work of selecting a job and
>> + * pushing it to the hardware. Both FIFO and RR selection algorithm are
>> + * supported, but FIFO is preferred for many use cases.
> 
> I suggest to name the use cases FIFO scheduling is preferred for and why.
> 
> If, instead, it's just a general recommendation, I also suggest to explain why.
> 
>> + *
>> + * The lifetime of the scheduler is managed by the driver using it. Before
>> + * destroying the scheduler the driver must ensure that all hardware processing
>> + * involving this scheduler object has finished by calling for example
>> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
>> + * since this doesn't guarantee that all callback processing has finished.
> 
> This is the part I'm most concerned about, since I feel like we leave drivers
> "up in the air" entirely. Hence, I think here we need to be more verbose and
> detailed about the options drivers have to ensure that.
> 
> For instance, let's assume we have the single-entity-per-scheduler topology
> because the driver only uses the GPU scheduler to feed a firmware scheduler with
> dynamically allocated ring buffers.
> 
> In this case the entity, scheduler and ring buffer are bound to the lifetime of
> a userspace process.
> 
> What do we expect the driver to do if the userspace process is killed? As you
> mentioned, only waiting for the ring to be idle (which implies all HW fences
> are signalled) is not enough. This doesn't guarantee all the free_job()
> callbacks have been called yet and hence stopping the scheduler before the
> pending_list is actually empty would leak the memory of the jobs on the
> pending_list waiting to be freed.
> 
> I already brought this up when we were discussing Matt's Xe inspired scheduler
> patch series and it seems there was no interest to provide drivers with some
> common mechanism that gurantees that the pending_list is empty. Hence, I really
> think we should at least give recommendations how drivers should deal with that.
> 
>> + *
>> + * The run queue object is a container of entities for a certain priority
>> + * level. This object is internally managed by the scheduler and drivers
>> + * shouldn't touch them directly. The lifetime of run queues are bound to the
>> + * schedulers lifetime.
> 
> I think we should also mention that we support a variable number of run-queues
> up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction on which
> priorities a driver can use when choosing less than DRM_SCHED_PRIORITY_COUNT
> run-queues.
> 
> For instance, initializing the scheduler with a single run-queue, requires the
> corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise we'll just
> fault since the priority is also used as an array index into sched->sched_rq[].
> 
>>    */
>>   
>>   /**
>> @@ -72,6 +166,42 @@
>>    * limit.
>>    */
>>   
>> +/**
>> + * DOC: Error and Timeout handling
>> + *
>> + * Errors schould be signaled by using dma_fence_set_error() on the hardware
>> + * fence object before signaling it. Errors are then bubbled up from the
>> + * hardware fence to the scheduler fence.
>> + *
>> + * The entity allows querying errors on the last run submission using the
>> + * drm_sched_entity_error() function which can be used to cancel queued
>> + * submissions in the run_job callback as well as preventing pushing further
>> + * ones into the entity in the drivers submission function.
>> + *
>> + * When the hardware fence fails to signal in a configurable amount of time the
>> + * timedout_job callback is issued. The driver should then follow the procedure
>> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
>> + * The timeout handler should probably switch to using the hardware fence as
>> + * parameter instead of the job. Otherwise the handling will always race
>> + * between timing out and signaling the fence).
>> + *
>> + * The scheduler also used to provided functionality for re-submitting jobs
>> + * with replacing the hardware fence during reset handling. This functionality
>> + * is now marked as deprecated. This has proven to be fundamentally racy and
>> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
>> + *
>> + * Additional there is the drm_sched_increase_karma() function which tries to
>   
> "Additionally"
> 
>> + * find the entity which submitted a job and increases it's 'karma'
>> + * atomic variable to prevent re-submitting jobs from this entity. This has
>> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
>> + * using this function is rather discouraged.
>> + *
>> + * Drivers can still re-create the GPU state should it be lost during timeout
>> + * handling when they can guarantee that forward progress is made and this
>> + * doesn't cause another timeout. But this is strongly hardware specific and
>> + * out of the scope of the general GPU scheduler.
>> + */
>> +
>>   #include <linux/wait.h>
>>   #include <linux/sched.h>
>>   #include <linux/completion.h>
>> -- 
>> 2.34.1
>>



More information about the dri-devel mailing list