[Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Tue Apr 11 14:13:14 UTC 2023

On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> The point is that this not only requires some work in the drm_scheduler, but
> rather it then makes only little sense to use the drm_scheduler in the first
> place.
> 
> The whole point of the drm_scheduler is to provide dma_fence implementation
> for the submitted jobs.
> 
> We also have dependency handling, but as Daniel and I said this can be
> easily extracted into a separate object/component.

Uh that's not what I meant. My take is that minimally patching drm/sched
to make the out-fence either optional, or complete it right away, is the
simplest way to get at the dependency handling. For me at least the major
part of drm/sched is the dep handling and timeout stuff. And the later can
be reused with some glue to handle preempt timeouts too and other things,
since tdr is a work struct you can just issue any other gpu timeouts on
the same workqueue and using the roughly same pattern as the ->timed_out
hook and it'll just work.

The entire "oh we also make sure your hw fence doesn't leak into public
fences and causes lifetime mayhem" seems pretty minor. And maybe also
something we want to replicate for the preempt-ctx dma_fence that some
long-running context need (but more as part of drm_sched_entity I guess).

We can of course bikeshed how much flexibility really should be in the
different parts of drm/sched, but imo that's a bikeshed.
-Daniel

> 
> Regards,
> Christian.
> 
> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
> > 
> > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
> > 
> > Thanks,
> > Oak
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost at intel.com>
> > > Sent: April 5, 2023 2:53 PM
> > > To: Zeng, Oak <oak.zeng at intel.com>
> > > Cc: Christian König <christian.koenig at amd.com>; Vetter, Daniel
> > > <daniel.vetter at intel.com>; Thomas Hellström
> > > <thomas.hellstrom at linux.intel.com>; dri-devel at lists.freedesktop.org; intel-
> > > xe at lists.freedesktop.org; robdclark at chromium.org; airlied at linux.ie;
> > > lina at asahilina.net; boris.brezillon at collabora.com; faith.ekstrand at collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > 
> > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > Hi,
> > > > 
> > > > Using dma-fence for completion/dependency tracking for long-run
> > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > cause deadlock. This seems the significant issue here. Other issues such as the
> > > drm scheduler completion order implication etc are minors which can be solve
> > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > 	1) still use drm scheduler for job submission, and use dma-fence for job
> > > completion waiting/dependency tracking. This is solution proposed in this series.
> > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> > > completion but can't wait dma-fence while holding any memory management
> > > locks.  We still use dma-fence for dependency tracking. But it is just very easily
> > > run into deadlock when on-demand paging is in the picture. The annotation helps
> > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> > > solution: It is almost impossible to completely avoid dependency deadlock in
> > > complex runtime environment
> > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > annotations enforce this. Literally this is only for flow controling the
> > > ring / hold pending jobs in in the DRM schedule list.
> > > 
> > > > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> > > and dependency tracking. This way we still get some free functions (reset, err
> > > handling ring flow control as Matt said)from drm scheduler, but push the
> > > dependency/completion tracking completely to user space using techniques such
> > > as user space fence. User space doesn't have chance to wait fence while holding
> > > a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> > > We use user space fence for syncs.
> > > 
> > > > 	3) Completely discard drm scheduler and dma-fence for long-run
> > > workload. Use user queue/doorbell for super fast submission, directly interact
> > > with fw scheduler. Use user fence for completion/dependency tracking.
> > > This is a hard no from me, I want 1 submission path in Xe. Either we use
> > > the DRM scheduler or we don't.
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -----Original Message-----
> > > > > From: Christian König <christian.koenig at amd.com>
> > > > > Sent: April 5, 2023 3:30 AM
> > > > > To: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> > > > > <oak.zeng at intel.com>
> > > > > Cc: dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org;
> > > > > robdclark at chromium.org; thomas.hellstrom at linux.intel.com;
> > > airlied at linux.ie;
> > > > > lina at asahilina.net; boris.brezillon at collabora.com;
> > > faith.ekstrand at collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > 
> > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > > > > > Hi Matt, Thomas,
> > > > > > > 
> > > > > > > Some very bold out of box thinking in this area:
> > > > > > > 
> > > > > > > 1. so you want to use drm scheduler and dma-fence for long running
> > > workload.
> > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> > > is
> > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> > > job
> > > > > to Guc, bypassing drm scheduler?
> > > > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > > > the ring, resets / error handling / backend submission implementations.
> > > > > > We don't want this.
> > > > > Well exactly that's the point: Why?
> > > > > 
> > > > > As far as I can see that are two completely distinct use cases, so you
> > > > > absolutely do want two completely distinct implementations for this.
> > > > > 
> > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault
> > > (and
> > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> > > be
> > > > > used purely because the nature of the workload that it runs very long
> > > (indefinite).
> > > > > I did a math: the dma_fence_wait_timeout function's third param is the
> > > timeout
> > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> > > long
> > > > > enough, can we just change the timeout parameter to signed 64 bits so it is
> > > much
> > > > > longer than our life time...
> > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not
> > > > > because the workload runs very long, rather because of the fact that we use
> > > > > page fault for long-run workload. If we enable page fault for short-run
> > > workload,
> > > > > we can't use dma-fence either. Page fault is the key thing here.
> > > > > > > Now since we use page fault which is *fundamentally* controversial with
> > > > > dma-fence design, why now just introduce a independent concept such as
> > > user-
> > > > > fence instead of extending existing dma-fence?
> > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work
> > > for
> > > > > everything, it is beautiful. But seems we have some fundamental problem
> > > here.
> > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > > > user space or other drivers. Think of this mode as SW only fence.
> > > > > Yeah and I truly think this is an really bad idea.
> > > > > 
> > > > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > > > this case the "I've told you so" is appropriate in my opinion.
> > > > > 
> > > > > If we need infrastructure for long running dependency tracking we should
> > > > > encapsulate that in a new framework and not try to mangle the existing
> > > > > code for something it was never intended for.
> > > > > 
> > > > > Christian.
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Oak
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of
> > > > > > > > Matthew Brost
> > > > > > > > Sent: April 3, 2023 8:22 PM
> > > > > > > > To: dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org
> > > > > > > > Cc: robdclark at chromium.org; thomas.hellstrom at linux.intel.com;
> > > > > airlied at linux.ie;
> > > > > > > > lina at asahilina.net; boris.brezillon at collabora.com; Brost, Matthew
> > > > > > > > <matthew.brost at intel.com>; christian.koenig at amd.com;
> > > > > > > > faith.ekstrand at collabora.com
> > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > > > have been asked to merge our common DRM scheduler patches first as
> > > well
> > > > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > > > welcome any and all feedback.
> > > > > > > > 
> > > > > > > > This can we thought of as 4 parts detailed below.
> > > > > > > > 
> > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > > > entity (patches 1-3)
> > > > > > > > 
> > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > > > severals problems as the DRM was originally designed to schedule jobs
> > > on
> > > > > > > > hardware queues. The main problem being that DRM scheduler expects
> > > the
> > > > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete
> > > out
> > > > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > > > 
> > > > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > > > infligt job tracker rather than a true scheduler.
> > > > > > > > 
> > > > > > > > - Generic messaging interface for DRM scheduler
> > > > > > > > 
> > > > > > > > Idea is to be able to communicate to the submission backend with in
> > > band
> > > > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > > > backend.
> > > > > > > > 
> > > > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > > > 
> > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > > > 
> > > > > > > > - Annotate dma-fences for long running workloads.
> > > > > > > > 
> > > > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > > > scheduler and never export them for long running workloads. By
> > > > > > > > annotating these fences as long running we ensure that these dma-
> > > fences
> > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > > > rules.
> > > > > > > > 
> > > > > > > > Again this a first draft and looking forward to feedback.
> > > > > > > > 
> > > > > > > > Enjoy - Matt
> > > > > > > > 
> > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > > > [4]
> > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > > > 
> > > > > > > > Matthew Brost (8):
> > > > > > > >     drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > > > >       kthread
> > > > > > > >     drm/sched: Move schedule policy to scheduler / entity
> > > > > > > >     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> > > policy
> > > > > > > >     drm/sched: Add generic scheduler message interface
> > > > > > > >     drm/sched: Start run wq before TDR in drm_sched_start
> > > > > > > >     drm/sched: Submit job before starting TDR
> > > > > > > >     drm/sched: Add helper to set TDR timeout
> > > > > > > >     drm/syncobj: Warn on long running dma-fences
> > > > > > > > 
> > > > > > > > Thomas Hellström (2):
> > > > > > > >     dma-buf/dma-fence: Introduce long-running completion fences
> > > > > > > >     drm/sched: Support long-running sched entities
> > > > > > > > 
> > > > > > > >    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > > > >    drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > > > >    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > > > >    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > > > >    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > > > >    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > > > >    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> > > ---
> > > > > > > >    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > > > >    include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > > > >    include/linux/dma-fence.h                   |  60 ++++-
> > > > > > > >    16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > > > 
> > > > > > > > --
> > > > > > > > 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch