[PATCH v4 10/14] drm/panthor: Add the scheduler logical block
Steven Price
steven.price at arm.com
Wed Feb 14 16:48:53 UTC 2024
On 22/01/2024 16:30, Boris Brezillon wrote:
> This is the piece of software interacting with the FW scheduler, and
> taking care of some scheduling aspects when the FW comes short of slots
> scheduling slots. Indeed, the FW only expose a few slots, and the kernel
> has to give all submission contexts, a chance to execute their jobs.
>
> The kernel-side scheduler is timeslice-based, with a round-robin queue
> per priority level.
>
> Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler,
> allowing us to delegate the dependency tracking to the core.
>
> All the gory details should be documented inline.
>
> v4:
> - Check drmm_mutex_init() return code
> - s/drm_gem_vmap_unlocked/drm_gem_vunmap_unlocked/ in
> panthor_queue_put_syncwait_obj()
> - Drop unneeded WARN_ON() in cs_slot_sync_queue_state_locked()
> - Use atomic_xchg() instead of atomic_fetch_and(0)
> - Fix typos
> - Let panthor_kernel_bo_destroy() check for IS_ERR_OR_NULL() BOs
> - Defer TILER_OOM event handling to a separate workqueue to prevent
> deadlocks when the heap chunk allocation is blocked on mem-reclaim.
> This is just a temporary solution, until we add support for
> non-blocking/failable allocations
> - Pass the scheduler workqueue to drm_sched instead of instantiating
> a separate one (no longer needed now that heap chunk allocation
> happens on a dedicated wq)
> - Set WQ_MEM_RECLAIM on the scheduler workqueue, so we can handle
> job timeouts when the system is under mem pressure, and hopefully
> free up some memory retained by these jobs
>
> v3:
> - Rework the FW event handling logic to avoid races
> - Make sure MMU faults kill the group immediately
> - Use the panthor_kernel_bo abstraction for group/queue buffers
> - Make in_progress an atomic_t, so we can check it without the reset lock
> held
> - Don't limit the number of groups per context to the FW scheduler
> capacity. Fix the limit to 128 for now.
> - Add a panthor_job_vm() helper
> - Account for panthor_vm changes
> - Add our job fence as DMA_RESV_USAGE_WRITE to all external objects
> (was previously DMA_RESV_USAGE_BOOKKEEP). I don't get why, given
> we're supposed to be fully-explicit, but other drivers do that, so
> there must be a good reason
> - Account for drm_sched changes
> - Provide a panthor_queue_put_syncwait_obj()
> - Unconditionally return groups to their idle list in
> panthor_sched_suspend()
> - Condition of sched_queue_{,delayed_}work fixed to be only when a reset
> isn't pending or in progress.
> - Several typos in comments fixed.
>
> Co-developed-by: Steven Price <steven.price at arm.com>
> Signed-off-by: Steven Price <steven.price at arm.com>
> Signed-off-by: Boris Brezillon <boris.brezillon at collabora.com>
Looks good, a few minor NITs below, but otherwise.
Reviewed-by: Steven Price <steven.price at arm.com>
> ---
> drivers/gpu/drm/panthor/panthor_sched.c | 3500 +++++++++++++++++++++++
> drivers/gpu/drm/panthor/panthor_sched.h | 48 +
> 2 files changed, 3548 insertions(+)
> create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
>
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> new file mode 100644
> index 000000000000..b57f4a5f5176
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -0,0 +1,3500 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <drm/panthor_drm.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_exec.h>
> +#include <drm/drm_gem_shmem_helper.h>
> +#include <drm/drm_managed.h>
> +#include <drm/gpu_scheduler.h>
> +
> +#include <linux/build_bug.h>
> +#include <linux/clk.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/firmware.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/iosys-map.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/dma-resv.h>
> +
> +#include "panthor_sched.h"
> +#include "panthor_devfreq.h"
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_heap.h"
> +#include "panthor_regs.h"
> +#include "panthor_gpu.h"
> +#include "panthor_fw.h"
> +#include "panthor_mmu.h"
> +
> +/**
> + * DOC: Scheduler
> + *
> + * Mali CSF hardware adopts a firmware-assisted scheduling model, where
> + * the firmware takes care of scheduling aspects, to some extend.
NIT: s/extend/extent/
> + *
> + * The scheduling happens at the scheduling group level, each group
> + * contains 1 to N queues (N is FW/hardware dependent, and exposed
> + * through the firmware interface). Each queue is assigned a command
> + * stream ring buffer, which serves as a way to get jobs submitted to
> + * the GPU, among other things.
> + *
> + * The firmware can schedule a maximum of M groups (M is FW/hardware
> + * dependent, and exposed through the firmware interface). Passed
> + * this maximum number of groups, the kernel must take care of
> + * rotating the groups passed to the firmware so every group gets
> + * a chance to have his queues scheduled for execution.
> + *
> + * The current implementation only supports with kernel-mode queues.
> + * In other terms, userspace doesn't have access to the ring-buffer.
> + * Instead, userspace passes indirect command stream buffers that are
> + * called from the queue ring-buffer by the kernel using a pre-defined
> + * sequence of command stream instructions to ensure the userspace driver
> + * always gets consistent results (cache maintenance,
> + * synchronization, ...).
> + *
> + * We rely on the drm_gpu_scheduler framework to deal with job
> + * dependencies and submission. As any other driver dealing with a
> + * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each
> + * entity has its own job scheduler. When a job is ready to be executed
> + * (all its dependencies are met), it is pushed to the appropriate
> + * queue ring-buffer, and the group is scheduled for execution if it
> + * wasn't already active.
> + *
> + * Kernel-side group scheduling is timeslice-based. When we have less
> + * groups than there are slots, the periodic tick is disabled and we
> + * just let the FW schedule the active groups. When there are more
> + * groups than slots, we let each group a chance to execute stuff for
> + * a given amount of time, and then re-evaluate and pick new groups
> + * to schedule. The group selection algorithm is based on
> + * priority+round-robin.
> + *
> + * Even though user-mode queues is out of the scope right now, the
> + * current design takes them into account by avoiding any guess on the
> + * group/queue state that would be based on information we wouldn't have
> + * if userspace was in charge of the ring-buffer. That's also one of the
> + * reason we don't do 'cooperative' scheduling (encoding FW group slot
> + * reservation as dma_fence that would be returned from the
> + * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as
> + * a queue of waiters, ordered by job submission order). This approach
> + * would work for kernel-mode queues, but would make user-mode queues a
> + * lot more complicated to retrofit.
> + */
> +
<snip>
> +
> +/**
> + * struct panthor_group - Scheduling group object
> + */
> +struct panthor_group {
> + /** @refcount: Reference count */
> + struct kref refcount;
> +
> + /** @ptdev: Device. */
> + struct panthor_device *ptdev;
> +
> + /** @vm: VM bound to the group. */
> + struct panthor_vm *vm;
> +
> + /** @compute_core_mask: Mask of shader cores that can be used for compute jobs. */
> + u64 compute_core_mask;
> +
> + /** @fragment_core_mask: Mask of shader cores that can be used for fragment jobs. */
> + u64 fragment_core_mask;
> +
> + /** @tiler_core_mask: Mask of tiler cores that can be used for tiler jobs. */
> + u64 tiler_core_mask;
> +
> + /** @max_compute_cores: Maximum number of shader cores used for compute jobs. */
> + u8 max_compute_cores;
> +
> + /** @max_compute_cores: Maximum number of shader cores used for fragment jobs. */
> + u8 max_fragment_cores;
> +
> + /** @max_tiler_cores: Maximum number of tiler cores used for tiler jobs. */
> + u8 max_tiler_cores;
> +
> + /** @priority: Group priority (check panthor_csg_priority). */
> + u8 priority;
> +
> + /** @blocked_queues: Bitmask reflecting the blocked queues. */
> + u32 blocked_queues;
> +
> + /** @idle_queues: Bitmask reflecting the blocked queues. */
s/blocked/idle/
> + u32 idle_queues;
> +
> + /** @fatal_lock: Lock used to protect access to fatal fields. */
> + spinlock_t fatal_lock;
> +
> + /** @fatal_queues: Bitmask reflecting the queues that hit a fatal exception. */
> + u32 fatal_queues;
> +
> + /** @tiler_oom: Mask of queues that have a tiler OOM event to process. */
> + atomic_t tiler_oom;
> +
> + /** @queue_count: Number of queues in this group. */
> + u32 queue_count;
> +
> + /** @queues: Queues owned by this group. */
> + struct panthor_queue *queues[MAX_CS_PER_CSG];
> +
> + /**
> + * @csg_id: ID of the FW group slot.
> + *
> + * -1 when the group is not scheduled/active.
> + */
> + int csg_id;
> +
> + /**
> + * @destroyed: True when the group has been destroyed.
> + *
> + * If a group is destroyed it becomes useless: no further jobs can be submitted
> + * to its queues. We simply wait for all references to be dropped so we can
> + * release the group object.
> + */
> + bool destroyed;
> +
> + /**
> + * @timedout: True when a timeout occurred on any of the queues owned by
> + * this group.
> + *
> + * Timeouts can be reported by drm_sched or by the FW. In any case, any
> + * timeout situation is unrecoverable, and the group becomes useless.
> + * We simply wait for all references to be dropped so we can release the
> + * group object.
> + */
> + bool timedout;
> +
> + /**
> + * @syncobjs: Pool of per-queue synchronization objects.
> + *
> + * One sync object per queue. The position of the sync object is
> + * determined by the queue index.
> + */
> + struct panthor_kernel_bo *syncobjs;
> +
> + /** @state: Group state. */
> + enum panthor_group_state state;
> +
> + /**
> + * @suspend_buf: Suspend buffer.
> + *
> + * Stores the state of the group and its queues when a group is suspended.
> + * Used at resume time to restore the group in its previous state.
> + *
> + * The size of the suspend buffer is exposed through the FW interface.
> + */
> + struct panthor_kernel_bo *suspend_buf;
> +
> + /**
> + * @protm_suspend_buf: Protection mode suspend buffer.
> + *
> + * Stores the state of the group and its queues when a group that's in
> + * protection mode is suspended.
> + *
> + * Used at resume time to restore the group in its previous state.
> + *
> + * The size of the protection mode suspend buffer is exposed through the
> + * FW interface.
> + */
> + struct panthor_kernel_bo *protm_suspend_buf;
> +
> + /** @sync_upd_work: Work used to check/signal job fences. */
> + struct work_struct sync_upd_work;
> +
> + /** @tiler_oom_work: Work used to process tiler OOM events happening on this group. */
> + struct work_struct tiler_oom_work;
> +
> + /** @term_work: Work used to finish the group termination procedure. */
> + struct work_struct term_work;
> +
> + /**
> + * @release_work: Work used to release group resources.
> + *
> + * We need to postpone the group release to avoid a deadlock when
> + * the last ref is released in the tick work.
> + */
> + struct work_struct release_work;
> +
> + /**
> + * @run_node: Node used to insert the group in the
> + * panthor_group::groups::{runnable,idle} and
> + * panthor_group::reset.stopped_groups lists.
> + */
> + struct list_head run_node;
> +
> + /**
> + * @wait_node: Node used to insert the group in the
> + * panthor_group::groups::waiting list.
> + */
> + struct list_head wait_node;
> +};
> +
> +/**
> + * group_queue_work() - Queue a group work
> + * @group: Group to queue the work for.
> + * @wname: Work name.
> + *
> + * Grabs a ref and queue a work item to the scheduler workqueue. If
> + * the work was already queued, we release the reference we grabbed.
> + *
> + * Work callbacks must release the reference we grabbed here.
> + */
> +#define group_queue_work(group, wname) \
> + do { \
> + group_get(group); \
> + if (!queue_work((group)->ptdev->scheduler->wq, &(group)->wname ## _work)) \
> + group_put(group); \
> + } while (0)
> +
> +/**
> + * sched_queue_work() - Queue a scheduler work.
> + * @sched: Scheduler object.
> + * @wname: Work name.
> + *
> + * Conditionally queues a scheduler work if no reset is pending/in-progress.
> + */
> +#define sched_queue_work(sched, wname) \
> + do { \
> + if (!atomic_read(&sched->reset.in_progress) && \
> + !panthor_device_reset_is_pending((sched)->ptdev)) \
> + queue_work((sched)->wq, &(sched)->wname ## _work); \
> + } while (0)
> +
> +/**
> + * sched_queue_delayed_work() - Queue a scheduler delayed work.
> + * @sched: Scheduler object.
> + * @wname: Work name.
> + * @delay: Work delay in jiffies.
> + *
> + * Conditionally queues a scheduler delayed work if no reset is
> + * pending/in-progress.
> + */
> +#define sched_queue_delayed_work(sched, wname, delay) \
> + do { \
> + if (!atomic_read(&sched->reset.in_progress) && \
> + !panthor_device_reset_is_pending((sched)->ptdev)) \
> + mod_delayed_work((sched)->wq, &(sched)->wname ## _work, delay); \
> + } while (0)
> +
> +/*
> + * We currently set the maximum of groups per file to an arbitrary low value.
> + * But this can be updated if we need more.
> + */
> +#define MAX_GROUPS_PER_POOL 128
> +
> +/**
> + * struct panthor_group_pool - Group pool
> + *
> + * Each file get assigned a group pool.
> + */
> +struct panthor_group_pool {
> + /** @xa: Xarray used to manage group handles. */
> + struct xarray xa;
> +};
> +
> +/**
> + * struct panthor_job - Used to manage GPU job
> + */
> +struct panthor_job {
> + /** @base: Inherit from drm_sched_job. */
> + struct drm_sched_job base;
> +
> + /** @refcount: Reference count. */
> + struct kref refcount;
> +
> + /** @group: Group of the queue this job will be pushed to. */
> + struct panthor_group *group;
> +
> + /** @queue_idx: Index of the queue inside @group. */
> + u32 queue_idx;
> +
> + /** @call_info: Information about the userspace command stream call. */
> + struct {
> + /** @start: GPU address of the userspace command stream. */
> + u64 start;
> +
> + /** @size: Size of the userspace command stream. */
> + u32 size;
> +
> + /**
> + * @latest_flush: Flush ID at the time the userspace command
> + * stream was built.
> + *
> + * Needed for the flush reduction mechanism.
> + */
> + u32 latest_flush;
> + } call_info;
> +
> + /** @ringbuf: Position of this job is in the ring buffer. */
> + struct {
> + /** @start: Start offset. */
> + u64 start;
> +
> + /** @end: End offset. */
> + u64 end;
> + } ringbuf;
> +
> + /**
> + * @node: Used to insert the job in the panthor_queue::fence_ctx::in_flight_jobs
> + * list.
> + */
> + struct list_head node;
> +
> + /** @done_fence: Fence signaled when the job is finished or cancelled. */
> + struct dma_fence *done_fence;
> +};
> +
> +static void
> +panthor_queue_put_syncwait_obj(struct panthor_queue *queue)
> +{
> + if (queue->syncwait.kmap) {
> + struct iosys_map map = IOSYS_MAP_INIT_VADDR(queue->syncwait.kmap);
> +
> + drm_gem_vunmap_unlocked(queue->syncwait.obj, &map);
> + queue->syncwait.kmap = NULL;
> + }
> +
> + drm_gem_object_put(queue->syncwait.obj);
> + queue->syncwait.obj = NULL;
> +}
> +
> +static void *
> +panthor_queue_get_syncwait_obj(struct panthor_group *group, struct panthor_queue *queue)
> +{
> + struct panthor_device *ptdev = group->ptdev;
> + struct panthor_gem_object *bo;
> + struct iosys_map map;
> + int ret;
> +
> + if (queue->syncwait.kmap)
> + return queue->syncwait.kmap + queue->syncwait.offset;
> +
> + bo = panthor_vm_get_bo_for_va(group->vm,
> + queue->syncwait.gpu_va,
> + &queue->syncwait.offset);
> + if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(bo)))
> + goto err_put_syncwait_obj;
> +
> + queue->syncwait.obj = &bo->base.base;
> + ret = drm_gem_vmap_unlocked(queue->syncwait.obj, &map);
> + if (drm_WARN_ON(&ptdev->base, ret))
> + goto err_put_syncwait_obj;
> +
> + queue->syncwait.kmap = map.vaddr;
> + if (drm_WARN_ON(&ptdev->base, !queue->syncwait.kmap))
> + goto err_put_syncwait_obj;
> +
> + return queue->syncwait.kmap + queue->syncwait.offset;
> +
> +err_put_syncwait_obj:
> + panthor_queue_put_syncwait_obj(queue);
> + return NULL;
> +}
> +
> +static void group_free_queue(struct panthor_group *group, struct panthor_queue *queue)
> +{
> + if (IS_ERR_OR_NULL(queue))
> + return;
> +
> + if (queue->entity.fence_context)
> + drm_sched_entity_destroy(&queue->entity);
> +
> + if (queue->scheduler.ops)
> + drm_sched_fini(&queue->scheduler);
> +
> + panthor_queue_put_syncwait_obj(queue);
> +
> + panthor_kernel_bo_destroy(group->vm, queue->ringbuf);
> + panthor_kernel_bo_destroy(panthor_fw_vm(group->ptdev), queue->iface.mem);
> +
> + kfree(queue);
> +}
> +
> +static void group_release_work(struct work_struct *work)
> +{
> + struct panthor_group *group = container_of(work,
> + struct panthor_group,
> + release_work);
> + struct panthor_device *ptdev = group->ptdev;
> + u32 i;
> +
> + for (i = 0; i < group->queue_count; i++)
> + group_free_queue(group, group->queues[i]);
> +
> + panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->suspend_buf);
> + panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->protm_suspend_buf);
> +
> + if (!IS_ERR_OR_NULL(group->syncobjs))
No need for this check - panthor_kernel_bo_destroy() does it.
> + panthor_kernel_bo_destroy(group->vm, group->syncobjs);
> +
> + panthor_vm_put(group->vm);
> + kfree(group);
> +}
Thanks,
Steve
More information about the dri-devel
mailing list