[PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2
Christian König
christian.koenig at amd.com
Wed Feb 14 08:23:00 UTC 2024
Am 13.02.24 um 18:37 schrieb Danilo Krummrich:
> Hi Christian,
>
> What's the status of this effort? Was there ever a follow-up?
It's unfortunately on hold for the moment since I have to look into some
internal things with highest priority. No idea when this will calm down
again.
Christian.
>
> - Danilo
>
> On 11/16/23 23:23, Danilo Krummrich wrote:
>> Hi Christian,
>>
>> Thanks for sending an update of this patch!
>>
>> On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
>>> Start to improve the scheduler document. Especially document the
>>> lifetime of each of the objects as well as the restrictions around
>>> DMA-fence handling and userspace compatibility.
>>>
>>> v2: Some improvements suggested by Danilo, add section about error
>>> handling.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> ---
>>> Documentation/gpu/drm-mm.rst | 36 +++++
>>> drivers/gpu/drm/scheduler/sched_main.c | 174
>>> +++++++++++++++++++++----
>>> 2 files changed, 188 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst
>>> b/Documentation/gpu/drm-mm.rst
>>> index acc5901ac840..112463fa9f3a 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,12 +552,48 @@ Overview
>>> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> :doc: Overview
>>> +Job Object
>>> +----------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Job Object
>>> +
>>> +Entity Object
>>> +-------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Entity Object
>>> +
>>> +Hardware Fence Object
>>> +---------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Hardware Fence Object
>>> +
>>> +Scheduler Fence Object
>>> +----------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Scheduler Fence Object
>>> +
>>> +Scheduler and Run Queue Objects
>>> +-------------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Scheduler and Run Queue Objects
>>> +
>>> Flow Control
>>> ------------
>>> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> :doc: Flow Control
>>> +Error and Timeout handling
>>> +--------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> + :doc: Error and Timeout handling
>>> +
>>> Scheduler Function References
>>> -----------------------------
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..026123497b0e 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -24,28 +24,122 @@
>>> /**
>>> * DOC: Overview
>>> *
>>> - * The GPU scheduler provides entities which allow userspace to
>>> push jobs
>>> - * into software queues which are then scheduled on a hardware run
>>> queue.
>>> - * The software queues have a priority among them. The scheduler
>>> selects the entities
>>> - * from the run queue using a FIFO. The scheduler provides
>>> dependency handling
>>> - * features among jobs. The driver is supposed to provide callback
>>> functions for
>>> - * backend operations to the scheduler like submitting a job to
>>> hardware run queue,
>>> - * returning the dependencies of a job etc.
>>> - *
>>> - * The organisation of the scheduler is the following:
>>> - *
>>> - * 1. Each hw run queue has one scheduler
>>> - * 2. Each scheduler has multiple run queues with different priorities
>>> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>>> - * 3. Each scheduler run queue has a queue of entities to schedule
>>> - * 4. Entities themselves maintain a queue of jobs that will be
>>> scheduled on
>>> - * the hardware.
>>> - *
>>> - * The jobs in a entity are always scheduled in the order that they
>>> were pushed.
>>> - *
>>> - * Note that once a job was taken from the entities queue and
>>> pushed to the
>>> - * hardware, i.e. the pending queue, the entity must not be
>>> referenced anymore
>>> - * through the jobs entity pointer.
>>> + * The GPU scheduler implements some logic to decide which command
>>> submission
>>> + * to push next to the hardware. Another major use case of the GPU
>>> scheduler
>>> + * is to enforce correct driver behavior around those command
>>> submissions.
>>> + * Because of this it's also used by drivers which don't need the
>>> actual
>>> + * scheduling functionality.
>>
>> This reads a bit like we're already right in the middle of the
>> documentation.
>> I'd propose to start with something like "The DRM GPU scheduler
>> serves as a
>> common component intended to help drivers to manage command
>> submissions." And
>> then mention the different solutions the scheduler provides, e.g.
>> ordering of
>> command submissions, dma-fence handling in the context of command
>> submissions,
>> etc.
>>
>> Also, I think it would be good to give a rough overview of the
>> topology of the
>> scheduler's components. And since you already mention that the
>> component is
>> "also used by drivers which don't need the actual scheduling
>> functionality", I'd
>> also mention the special case of a single entity and a single
>> run-queue per
>> scheduler.
>>
>>> + *
>>> + * All callbacks the driver needs to implement are restricted by
>>> DMA-fence
>>> + * signaling rules to guarantee deadlock free forward progress.
>>> This especially
>>> + * means that for normal operation no memory can be allocated in a
>>> callback.
>>> + * All memory which is needed for pushing the job to the hardware
>>> must be
>>> + * allocated before arming a job. It also means that no locks can
>>> be taken
>>> + * under which memory might be allocated as well.
>>
>> I think that's good. Even though, with the recently merged workqueue
>> patches,
>> drivers can actually create a setup where the free_job callback isn't
>> part of
>> the fence signalling critical path anymore. But I agree with Sima
>> that this is
>> probably too error prone to give drivers ideas. So, this paragraph is
>> probably
>> good as it is. :-)
>>
>>> + *
>>> + * Memory which is optional to allocate, for example for device
>>> core dumping or
>>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate
>>> error
>>> + * handling taking if that allocation fails. GFP_ATOMIC should only
>>> be used if
>>> + * absolutely necessary since dipping into the special atomic
>>> reserves is
>>> + * usually not justified for a GPU driver.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Job Object
>>> + *
>>> + * The base job object contains submission dependencies in the form
>>> of DMA-fence
>>> + * objects. Drivers can also implement an optional prepare_job
>>> callback which
>>> + * returns additional dependencies as DMA-fence objects. It's
>>> important to note
>>> + * that this callback can't allocate memory or grab locks under
>>> which memory is
>>> + * allocated.
>>> + *
>>> + * Drivers should use this as base class for an object which
>>> contains the
>>> + * necessary state to push the command submission to the hardware.
>>> + *
>>> + * The lifetime of the job object should at least be from pushing
>>> it into the
>>
>> "should at least last from"
>>
>>> + * scheduler until the scheduler notes through the free callback
>>> that a job
>>
>> What about "until the free_job callback has been called and hence the
>> scheduler
>> does not require the job object anymore."?
>>
>>> + * isn't needed any more. Drivers can of course keep their job
>>> object alive
>>> + * longer than that, but that's outside of the scope of the scheduler
>>> + * component. Job initialization is split into two parts,
>>> drm_sched_job_init()
>>> + * and drm_sched_job_arm(). It's important to note that after
>>> arming a job
>>
>> I suggest to add a brief comment on why job initialization is split up.
>>
>>> + * drivers must follow the DMA-fence rules and can't easily
>>> allocate memory
>>> + * or takes locks under which memory is allocated.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Entity Object
>>> + *
>>> + * The entity object which is a container for jobs which should
>>> execute
>>> + * sequentially. Drivers should create an entity for each
>>> individual context
>>> + * they maintain for command submissions which can run in parallel.
>>> + *
>>> + * The lifetime of the entity should *not* exceed the lifetime of the
>>> + * userspace process it was created for and drivers should call the
>>> + * drm_sched_entity_flush() function from their file_operations.flush
>>> + * callback. So it's possible that an entity object is not alive any
>>
>> "Note that it is possible..."
>>
>>> + * more while jobs from it are still running on the hardware.
>>
>> "while jobs previously fetched from this entity are still..."
>>
>>> + *
>>> + * Background is that for compatibility reasons with existing
>>> + * userspace all results of a command submission should become visible
>>> + * externally even after after a process exits. This is normal
>>> POSIX behavior
>>> + * for I/O operations.
>>> + *
>>> + * The problem with this approach is that GPU submissions contain
>>> executable
>>> + * shaders enabling processes to evade their termination by
>>> offloading work to
>>> + * the GPU. So when a process is terminated with a SIGKILL the
>>> entity object
>>> + * makes sure that jobs are freed without running them while still
>>> maintaining
>>> + * correct sequential order for signaling fences.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Hardware Fence Object
>>> + *
>>> + * The hardware fence object is a DMA-fence provided by the driver
>>> as result of
>>> + * running jobs. Drivers need to make sure that the normal
>>> DMA-fence semantics
>>> + * are followed for this object. It's important to note that the
>>> memory for
>>> + * this object can *not* be allocated in the run_job callback since
>>> that would
>>> + * violate the requirements for the DMA-fence implementation. The
>>> scheduler
>>> + * maintains a timeout handler which triggers if this fence doesn't
>>> signal in
>>> + * a configurable time frame.
>>> + *
>>> + * The lifetime of this object follows DMA-fence ref-counting
>>> rules, the
>>> + * scheduler takes ownership of the reference returned by the
>>> driver and drops
>>> + * it when it's not needed any more.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler Fence Object
>>> + *
>>> + * The scheduler fence object which encapsulates the whole time
>>> from pushing
>>> + * the job into the scheduler until the hardware has finished
>>> processing it.
>>> + * This is internally managed by the scheduler, but drivers can
>>> grab additional
>>> + * reference to it after arming a job. The implementation provides
>>> DMA-fence
>>> + * interfaces for signaling both scheduling of a command submission
>>> as well as
>>> + * finishing of processing.
>>> + *
>>> + * The lifetime of this object also follows normal DMA-fence
>>> ref-counting
>>> + * rules. The finished fence is the one normally exposed outside of
>>> the
>>> + * scheduler, but the driver can grab references to both the
>>> scheduled as well
>>> + * as the finished fence when needed for pipe-lining optimizations.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler and Run Queue Objects
>>> + *
>>> + * The scheduler object itself does the actual work of selecting a
>>> job and
>>> + * pushing it to the hardware. Both FIFO and RR selection algorithm
>>> are
>>> + * supported, but FIFO is preferred for many use cases.
>>
>> I suggest to name the use cases FIFO scheduling is preferred for and
>> why.
>>
>> If, instead, it's just a general recommendation, I also suggest to
>> explain why.
>>
>>> + *
>>> + * The lifetime of the scheduler is managed by the driver using it.
>>> Before
>>> + * destroying the scheduler the driver must ensure that all
>>> hardware processing
>>> + * involving this scheduler object has finished by calling for example
>>> + * disable_irq(). It is *not* sufficient to wait for the hardware
>>> fence here
>>> + * since this doesn't guarantee that all callback processing has
>>> finished.
>>
>> This is the part I'm most concerned about, since I feel like we leave
>> drivers
>> "up in the air" entirely. Hence, I think here we need to be more
>> verbose and
>> detailed about the options drivers have to ensure that.
>>
>> For instance, let's assume we have the single-entity-per-scheduler
>> topology
>> because the driver only uses the GPU scheduler to feed a firmware
>> scheduler with
>> dynamically allocated ring buffers.
>>
>> In this case the entity, scheduler and ring buffer are bound to the
>> lifetime of
>> a userspace process.
>>
>> What do we expect the driver to do if the userspace process is
>> killed? As you
>> mentioned, only waiting for the ring to be idle (which implies all HW
>> fences
>> are signalled) is not enough. This doesn't guarantee all the free_job()
>> callbacks have been called yet and hence stopping the scheduler
>> before the
>> pending_list is actually empty would leak the memory of the jobs on the
>> pending_list waiting to be freed.
>>
>> I already brought this up when we were discussing Matt's Xe inspired
>> scheduler
>> patch series and it seems there was no interest to provide drivers
>> with some
>> common mechanism that gurantees that the pending_list is empty.
>> Hence, I really
>> think we should at least give recommendations how drivers should deal
>> with that.
>>
>>> + *
>>> + * The run queue object is a container of entities for a certain
>>> priority
>>> + * level. This object is internally managed by the scheduler and
>>> drivers
>>> + * shouldn't touch them directly. The lifetime of run queues are
>>> bound to the
>>> + * schedulers lifetime.
>>
>> I think we should also mention that we support a variable number of
>> run-queues
>> up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction
>> on which
>> priorities a driver can use when choosing less than
>> DRM_SCHED_PRIORITY_COUNT
>> run-queues.
>>
>> For instance, initializing the scheduler with a single run-queue,
>> requires the
>> corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise
>> we'll just
>> fault since the priority is also used as an array index into
>> sched->sched_rq[].
>>
>>> */
>>> /**
>>> @@ -72,6 +166,42 @@
>>> * limit.
>>> */
>>> +/**
>>> + * DOC: Error and Timeout handling
>>> + *
>>> + * Errors schould be signaled by using dma_fence_set_error() on the
>>> hardware
>>> + * fence object before signaling it. Errors are then bubbled up
>>> from the
>>> + * hardware fence to the scheduler fence.
>>> + *
>>> + * The entity allows querying errors on the last run submission
>>> using the
>>> + * drm_sched_entity_error() function which can be used to cancel
>>> queued
>>> + * submissions in the run_job callback as well as preventing
>>> pushing further
>>> + * ones into the entity in the drivers submission function.
>>> + *
>>> + * When the hardware fence fails to signal in a configurable amount
>>> of time the
>>> + * timedout_job callback is issued. The driver should then follow
>>> the procedure
>>> + * described on the &struct drm_sched_backend_ops.timedout_job
>>> callback (TODO:
>>> + * The timeout handler should probably switch to using the hardware
>>> fence as
>>> + * parameter instead of the job. Otherwise the handling will always
>>> race
>>> + * between timing out and signaling the fence).
>>> + *
>>> + * The scheduler also used to provided functionality for
>>> re-submitting jobs
>>> + * with replacing the hardware fence during reset handling. This
>>> functionality
>>> + * is now marked as deprecated. This has proven to be fundamentally
>>> racy and
>>> + * not compatible with DMA-fence rules and shouldn't be used in any
>>> new code.
>>> + *
>>> + * Additional there is the drm_sched_increase_karma() function
>>> which tries to
>> "Additionally"
>>
>>> + * find the entity which submitted a job and increases it's 'karma'
>>> + * atomic variable to prevent re-submitting jobs from this entity.
>>> This has
>>> + * quite some overhead and re-submitting jobs is now marked as
>>> deprecated. So
>>> + * using this function is rather discouraged.
>>> + *
>>> + * Drivers can still re-create the GPU state should it be lost
>>> during timeout
>>> + * handling when they can guarantee that forward progress is made
>>> and this
>>> + * doesn't cause another timeout. But this is strongly hardware
>>> specific and
>>> + * out of the scope of the general GPU scheduler.
>>> + */
>>> +
>>> #include <linux/wait.h>
>>> #include <linux/sched.h>
>>> #include <linux/completion.h>
>>> --
>>> 2.34.1
>>>
>
More information about the dri-devel
mailing list