[Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Tue Apr 4 19:07:20 UTC 2023

On Tue, 4 Apr 2023 at 19:29, Tvrtko Ursulin
<tvrtko.ursulin at linux.intel.com> wrote:
>
>
> On 04/04/2023 14:52, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 04/04/2023 01:22, Matthew Brost wrote:
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>
> >> Once you add capability for a more proper 1:1 via
> >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> >> kthreads with a wq?
> >>
> >> Or in other words, what purpose does the offloading of a job picking code to
> >> a separate execution context serve? Could it be done directly in the 1:1
> >> mode and leave kthread setup for N:M?
> >>
> >
> > Addressed the other two on my reply to Christian...
> >
> > For this one basically the concept of a single entity point IMO is a
> > very good concept which I'd like to keep. But most important reason
> > being the main execution thread (now worker) is kicked when a dependency
> > for a job is resolved, dependencies are dma-fences signaled via a
> > callback, and these call backs can be signaled in IRQ contexts. We
> > absolutely do not want to enter the backend in an IRQ context for a
> > variety of reasons.
>
> Sounds like a fair enough requirement but if drivers will not be
> comfortable with the wq conversion, it is probably possible to introduce
> some vfuncs for the 1:1 case which would allow scheduler users override
> the scheduler wakeup and select a special "pick one job" path. That
> could allow 1:1 users do their thing, leaving rest as is. I mean you
> already have the special single entity scheduler, you'd just need to add
> some more specialization on the init, wake up, etc paths.
>
> And I will mention once more that I find a wq item with a loop such as:
>
>         while (!READ_ONCE(sched->pause_run_wq)) {
>         ...
>
> A bit dodgy. If you piggy back on any system_wq it smells of system wide
> starvation so for me any proposal with an option to use a system shared
> wq is a no go.

Yeah I think the argument for wq based scheduler would need a
per-drm_scheduler wq, like we currently have a per-scheduler kthread.
It might still need some serious work to replace the
kthread_stop/start() with a wq-native equivalent (because really this
is the tricky stuff we shouldn't hand-roll unless someone is willing
to write a few papers on the lockless design that's done), but would
look a bunch more reasonable. Having a per-sched workqueue might also
help with the big sched_stop/start/fini state transitions, which I
really think should still go over all the per-entity schedulers even
in the 1:1 case (because otherwise you get some funky code in drivers
that do the iterations needed, which probably tosses the fairly nice
design the current scheduler has by relying on the kthread_stop/start
primitives for this.
-Daniel

>
> Regards,
>
> Tvrtko
>
>
> >> Apart from those design level questions, low level open IMO still is that
> >> default fallback of using the system_wq has the potential to affect latency
> >> for other drivers. But that's for those driver owners to approve.
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the submission backend with in band
> >>> (relative to main execution function) messages. Messages are backend
> >>> defined and flexable enough for any use case. In Xe we use these
> >>> messages to clean up entites, set properties for entites, and suspend /
> >>> resume execution of an entity [5]. I suspect other driver can leverage
> >>> this messaging concept too as it a convenient way to avoid races in the
> >>> backend.
> >>>
> >>> - Support for using TDR for all error paths of a scheduler / entity
> >>>
> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> >>>
> >>> - Annotate dma-fences for long running workloads.
> >>>
> >>> The idea here is to use dma-fences only as sync points within the
> >>> scheduler and never export them for long running workloads. By
> >>> annotating these fences as long running we ensure that these dma-fences
> >>> are never used in a way that breaks the dma-fence rules. A benefit of
> >>> thus approach is the scheduler can still safely flow control the
> >>> execution ring buffer via the job limit without breaking the dma-fence
> >>> rules.
> >>>
> >>> Again this a first draft and looking forward to feedback.
> >>>
> >>> Enjoy - Matt
> >>>
> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> >>> [2] https://patchwork.freedesktop.org/series/112188/
> >>> [3] https://patchwork.freedesktop.org/series/114772/
> >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> >>>
> >>> Matthew Brost (8):
> >>>     drm/sched: Convert drm scheduler to use a work queue rather than
> >>>       kthread
> >>>     drm/sched: Move schedule policy to scheduler / entity
> >>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >>>     drm/sched: Add generic scheduler message interface
> >>>     drm/sched: Start run wq before TDR in drm_sched_start
> >>>     drm/sched: Submit job before starting TDR
> >>>     drm/sched: Add helper to set TDR timeout
> >>>     drm/syncobj: Warn on long running dma-fences
> >>>
> >>> Thomas Hellström (2):
> >>>     dma-buf/dma-fence: Introduce long-running completion fences
> >>>     drm/sched: Support long-running sched entities
> >>>
> >>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >>>    drivers/dma-buf/dma-resv.c                  |   5 +
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
> >>>    include/linux/dma-fence.h                   |  60 ++++-
> >>>    16 files changed, 649 insertions(+), 184 deletions(-)
> >>>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch