[PATCH] drm/scheduler: improve GPU scheduler documentation

Fri Nov 17 09:01:34 UTC 2023

Am 17.11.23 um 04:18 schrieb Luben Tuikov:
> On 2023-11-13 07:38, Christian König wrote:
>> Start to improve the scheduler document. Especially document the
>> lifetime of each of the objects as well as the restrictions around
>> DMA-fence handling and userspace compatibility.
> Thanks Christian for doing this--much needed.
>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 126 ++++++++++++++++++++-----
>>   1 file changed, 104 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 506371c42745..36a7c5dc852d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,110 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push jobs
>> - * into software queues which are then scheduled on a hardware run queue.
>> - * The software queues have a priority among them. The scheduler selects the entities
>> - * from the run queue using a FIFO. The scheduler provides dependency handling
>> - * features among jobs. The driver is supposed to provide callback functions for
>> - * backend operations to the scheduler like submitting a job to hardware run queue,
>> - * returning the dependencies of a job etc.
> So, I don't mind this paragraph, as it provides an overview or the relationship between
> a DRM GPU scheduler, entities, run-queues, and jobs.
>
>> - *
>> - * The organisation of the scheduler is the following:
>> - *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
>> - *    the hardware.
> This is also good, and shouldn't have been deleted.
>
>> - *
>> - * The jobs in a entity are always scheduled in the order that they were pushed.
> I'd have said here "jobs within an entity". Again shouldn't have been deleted.
> This is good overall/overview information.

I was hoping that I have incorporated all this into the per object 
description text. Did I miss anything?

>
>> - *
>> - * Note that once a job was taken from the entities queue and pushed to the
>> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
>> - * through the jobs entity pointer.
>> + * The GPU scheduler implements some logic to decide which command submission
>> + * to push next to the hardware. Another major use case for the GPU scheduler
>> + * is to enforce correct driver behavior around those command submission.
> The GPU scheduler also enforces correct driver behaviour around those command submissions.
>
>> + * Because of this it's also used by drivers which don't need the actual
>> + * scheduling functionality.
> ... but need to push jobs into their firmware/hardware and maintain/keep correct
> DRM dependencies in the form of "fences".

Good point, going to add that.

>
>> + *
>> + * To fulfill this task the GPU scheduler uses of the following objects:
>> + *
>> + * 1. The job object which contains a bunch of dependencies in the form of
> Drop "which".
>
> Instead of listing what it contains, how it is used, what it does, explain
> what it is: work/task to be executed by the GPU. _Then_ you can start listing
> what it contains, how it is used, what it does.
>
>> + *    DMA-fence objects. Drivers can also implement an optional prepare_job
>> + *    callback which returns additional dependencies as DMA-fence objects.
> "can also"? This would usually follow if the other callbacks/etc., have been described
> and they haven't, so I'd say "Drivers implement an optional prepare_job callback,..."
>
>> + *    It's important to note that this callback must follow the DMA-fence rules,
>> + *    so it can't easily allocate memory or grab locks under which memory is
>> + *    allocated. Drivers should use this as base class for an object which
>> + *    contains the necessary state to push the command submission to the
>> + *    hardware.
>> + *
>> + *    The lifetime of the job object should at least be from pushing it into the
>> + *    scheduler until the scheduler notes through the free callback that a job
>> + *    isn't needed any more. Drivers can of course keep their job object alive
>> + *    longer than that, but that's outside of the scope of the scheduler
>> + *    component.
> [New paragraph starts describing the job initialization.]
>
> Add:  Job initialization is split into two parts,
>> + *    drm_sched_job_init() and drm_sched_job_arm().
> Perhaps we should mention briefly what each one does..?
>
> Add:  It's important to note that
>> + *    after arming a job drivers must follow the DMA-fence rules and can't
>> + *    easily allocate memory or takes locks under which memory is allocated.
>> + *
>> + * 2. The entity object which is a container for jobs which should execute
> Drop "which". "The entity object is a container of ..."
>
>> + *    sequentially. Drivers should create an entity for each individual context
>> + *    they maintain for command submissions which can run in parallel.
> This isn't very clear, but can be made so by: "they maintain for command submissions."
> "Entities' jobs can run in parallel as facilitated by the GPU."
>
>> + *
>> + *    The lifetime of the entity should *not* exceed the lifetime of the
>> + *    userspace process it was created for and drivers should call the
>> + *    drm_sched_entity_flush() function from their file_operations.flush
>> + *    callback.
> Okay... they should... WHEN? When we use "should do something" we always clarify with "when xyz happens."
>
> Add:  Background is that for compatibility reasons with existing
>> + *    userspace all results of a command submission should become visible
> "Background" --> "The reason for this is for compatibility with existing ..."
>
>> + *    externally even after after a process exits. The only exception to that
> Remove one of the two "after".
>
>> + *    is when the process is actively killed by a SIGKILL. In this case the
>> + *    entity object makes sure that jobs are freed without running them while
>> + *    still maintaining correct sequential order for signaling fences. So it's
>> + *    possible that an entity object is not alive any more while jobs from it
> "to not be alive"
> (This paragraph here, including SIGKILL, could be made clearer, by splitting it in two
> parts. One describing normal behaviour, i.e. SIGTERM, and the other describing SIGKILL.)
>
>> + *    are still running on the hardware.
>> + *
>> + * 3. The hardware fence object which is a DMA-fence provided by the driver as
> Drop "which". "The hardware fence object is a DMA-fence which is provided by the driver
> as _a_ result of running a job/jobs."
>
>> + *    result of running jobs. Drivers need to make sure that the normal
>> + *    DMA-fence semantics are followed for this object. It's important to note
> "DMA-fence semantics" are mentioned several times so far, and a link to a description
> of said semantics (or NULL if none is in the kernel)--would be nice to put.
>
>> + *    that the memory for this object can *not* be allocated in the run_job
>> + *    callback since that would violate the requirements for the DMA-fence
>> + *    implementation.
> Is it "no allocation" or just "allocation for this object" in run_job?
> (Or maybe no kernel allocation..., we should probably clarify this.)
>
> Add:  The scheduler maintains a timeout handler which triggers
>> + *    if this fence doesn't signal in a configurable time frame.
>> + *
>> + *    The lifetime of this object follows DMA-fence ref-counting rules, the
>> + *    scheduler takes ownership of the reference returned by the driver and
>> + *    drops it when it's not needed any more. Errors should also be signaled
>> + *    through the hardware fence and are bubbled up back to the scheduler fence
> By whom?
> "through the hardware fence by the driver, and are ..."
>
>> + *    and entity.
>> + *
>> + * 4. The scheduler fence object which encapsulates the whole time from pushing
>> + *    the job into the scheduler until the hardware has finished processing it.
> Perhaps drop "time encapsulation."
>
> It's not very clear what you want to say here. Perhaps, use "exists" and drop "which", as in:
> 4. The scheduler fence object exists/is ref-ed by DRM, etc., from the time when the job is
> pushed into the scheduler until the hardware has finished with it.
>
> If fence signaling is involved in those two steps, it should be noted here.
>
> If this is about ownership, it should be simply stated.
>
>> + *    This is internally managed by the scheduler, but drivers can grab
>> + *    additional reference to it after arming a job. The implementation
> Instead of "implementation" perhaps use "DRM scheduler code?"
>
>> + *    provides DMA-fence interfaces for signaling both scheduling of a command
>> + *    submission as well as finishing of processing.
> I'd clarify with
> 	"provides DMA-fence interfaces for drivers, for the scheduling of a command
> 	 submission, akin to the start of a command, and finishing command processing."
>
> Perhaps we can also mention when drivers are supposed to call these...?
>
>> + *
>> + *    The lifetime of this object also follows normal DMA-fence ref-counting
>> + *    rules. The finished fence is the one normally exposed outside of the
>> + *    scheduler, but the driver can grab references to both the scheduled as
>> + *    well as the finished fence when needed for pipe-lining optimizations.
>> + *
>> + * 5. The run queue object which is a container of entities for a certain
>> + *    priority level. The lifetime of those objects are bound to the scheduler
>> + *    lifetime.
> "which is" --> "is"
> "entities for a certain" --> "entities of a certain"
>
>> + *
>> + *    This is internally managed by the scheduler and drivers shouldn't touch
>> + *    them directly.
>> + *
>> + * 6. The scheduler object itself which does the actual work of selecting a job
>> + *    and pushing it to the hardware. Both FIFO and RR selection algorithm are
>> + *    supported, but FIFO is preferred for many use cases.
> Let's drop "which" and just say "The scheduler object does the actual work of ..."
>
>> + *
>> + *    The lifetime of this object is managed by the driver using it. Before
>> + *    destroying the scheduler the driver must ensure that all hardware
>> + *    processing involving this scheduler object has finished by calling for
>> + *    example disable_irq(). It is *not* sufficient to wait for the hardware
>> + *    fence here since this doesn't guarantee that all callback processing has
>> + *    finished.
> Okay, and perhaps we could mention drm_sched_fini() here.
>
>> + *
>> + * All callbacks the driver needs to implement are restricted by DMA-fence
>> + * signaling rules to guarantee deadlock free forward progress. This especially
>> + * means that for normal operation no memory can be allocated. All memory which
> Need a pointer to the "DMA-fence signaling rules" and also need to define "normal operation",
> or clarify it in the sentence.
>
>> + * is needed for pushing the job to the hardware must be allocated before
>> + * arming a job. It also means that no locks can be taken under which memory
>> + * might be allocated as well.
> Yes, that makes sense. I think this is generally the case for driver and firmware
> writers, and I'd think is common sense, but it is good to have it in writing.
>
>> + *
>> + * Memory which is optional to allocate for device core dumping or debugging
>> + * *must* be allocated with GFP_NOWAIT and appropriate error handling taking if
>> + * that allocation fails. GFP_ATOMIC should only be used if absolutely
>> + * necessary since dipping into the special atomic reserves is usually not
>> + * justified for a GPU driver.
> Yes, that's well said. Good to have it here.
>
>> + *
>> + * The scheduler also used to provided functionality for re-submitting jobs
> "The scheduler also used ..." --> "The scheduler is also used ..."
>
>> + * with replacing the hardware fence during reset handling. This functionality
>> + * is now marked as deprecated since this has proven to be fundamentally racy
>> + * and not compatible with DMA-fence rules and shouldn't be used in any new
>> + * code.
> I wouldn't use a contraction here. In order to emphasize that job re-submission
> is depreciated, I'd use:
> 	"and should not be used in any new code."

I only skimmed over the rest of the comment. Danilo already suggested 
some similar notes, especially on the style which resulted in a v2 of 
the documentation.

Going over this here later today and will probably send out a v3 on Monday.

Thanks,
Christian.

>
>>    */
>>   
>>   #include <linux/kthread.h>