[RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Mon Apr 17 06:47:19 UTC 2023

Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
>> The point is that this not only requires some work in the drm_scheduler, but
>> rather it then makes only little sense to use the drm_scheduler in the first
>> place.
>>
>> The whole point of the drm_scheduler is to provide dma_fence implementation
>> for the submitted jobs.
>>
>> We also have dependency handling, but as Daniel and I said this can be
>> easily extracted into a separate object/component.
> Uh that's not what I meant. My take is that minimally patching drm/sched
> to make the out-fence either optional, or complete it right away, is the
> simplest way to get at the dependency handling. For me at least the major
> part of drm/sched is the dep handling and timeout stuff. And the later can
> be reused with some glue to handle preempt timeouts too and other things,
> since tdr is a work struct you can just issue any other gpu timeouts on
> the same workqueue and using the roughly same pattern as the ->timed_out
> hook and it'll just work.

Well that strongly sounds like what I had in mind as well.

If we move the dependency/timeout functionality into a new component or 
if we move the scheduler fence into a new component doesn't seem to 
matter, the high level goal is that we have separated the two 
functionalities and both approach will work for that.

> The entire "oh we also make sure your hw fence doesn't leak into public
> fences and causes lifetime mayhem" seems pretty minor. And maybe also
> something we want to replicate for the preempt-ctx dma_fence that some
> long-running context need (but more as part of drm_sched_entity I guess).
>
> We can of course bikeshed how much flexibility really should be in the
> different parts of drm/sched, but imo that's a bikeshed.

Well the dependency handling in a separate component would still be 
interesting to have since we need something similar for user queues as well.

Christian.

> -Daniel
>
>
>> Regards,
>> Christian.
>>
>> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
>>> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
>>>
>>> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
>>>
>>> Thanks,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: Brost, Matthew <matthew.brost at intel.com>
>>>> Sent: April 5, 2023 2:53 PM
>>>> To: Zeng, Oak <oak.zeng at intel.com>
>>>> Cc: Christian König <christian.koenig at amd.com>; Vetter, Daniel
>>>> <daniel.vetter at intel.com>; Thomas Hellström
>>>> <thomas.hellstrom at linux.intel.com>; dri-devel at lists.freedesktop.org; intel-
>>>> xe at lists.freedesktop.org; robdclark at chromium.org; airlied at linux.ie;
>>>> lina at asahilina.net; boris.brezillon at collabora.com; faith.ekstrand at collabora.com
>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>
>>>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>>>>> Hi,
>>>>>
>>>>> Using dma-fence for completion/dependency tracking for long-run
>>>> workload(more precisely on-demand paging/page fault enabled workload) can
>>>> cause deadlock. This seems the significant issue here. Other issues such as the
>>>> drm scheduler completion order implication etc are minors which can be solve
>>>> inside the framework of drm scheduler. We need to evaluate below paths:
>>>>> 	1) still use drm scheduler for job submission, and use dma-fence for job
>>>> completion waiting/dependency tracking. This is solution proposed in this series.
>>>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
>>>> completion but can't wait dma-fence while holding any memory management
>>>> locks.  We still use dma-fence for dependency tracking. But it is just very easily
>>>> run into deadlock when on-demand paging is in the picture. The annotation helps
>>>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
>>>> solution: It is almost impossible to completely avoid dependency deadlock in
>>>> complex runtime environment
>>>> No one can wait on LR fence, so it is impossible to deadlock. The
>>>> annotations enforce this. Literally this is only for flow controling the
>>>> ring / hold pending jobs in in the DRM schedule list.
>>>>
>>>>> 	2) Still use drm scheduler but not use dma-fence for completion signaling
>>>> and dependency tracking. This way we still get some free functions (reset, err
>>>> handling ring flow control as Matt said)from drm scheduler, but push the
>>>> dependency/completion tracking completely to user space using techniques such
>>>> as user space fence. User space doesn't have chance to wait fence while holding
>>>> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>>>> We use user space fence for syncs.
>>>>
>>>>> 	3) Completely discard drm scheduler and dma-fence for long-run
>>>> workload. Use user queue/doorbell for super fast submission, directly interact
>>>> with fw scheduler. Use user fence for completion/dependency tracking.
>>>> This is a hard no from me, I want 1 submission path in Xe. Either we use
>>>> the DRM scheduler or we don't.
>>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Oak
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König <christian.koenig at amd.com>
>>>>>> Sent: April 5, 2023 3:30 AM
>>>>>> To: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
>>>>>> <oak.zeng at intel.com>
>>>>>> Cc: dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org;
>>>>>> robdclark at chromium.org; thomas.hellstrom at linux.intel.com;
>>>> airlied at linux.ie;
>>>>>> lina at asahilina.net; boris.brezillon at collabora.com;
>>>> faith.ekstrand at collabora.com
>>>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>
>>>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>>>>> Hi Matt, Thomas,
>>>>>>>>
>>>>>>>> Some very bold out of box thinking in this area:
>>>>>>>>
>>>>>>>> 1. so you want to use drm scheduler and dma-fence for long running
>>>> workload.
>>>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler
>>>> is
>>>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit
>>>> job
>>>>>> to Guc, bypassing drm scheduler?
>>>>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>>>>> the ring, resets / error handling / backend submission implementations.
>>>>>>> We don't want this.
>>>>>> Well exactly that's the point: Why?
>>>>>>
>>>>>> As far as I can see that are two completely distinct use cases, so you
>>>>>> absolutely do want two completely distinct implementations for this.
>>>>>>
>>>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault
>>>> (and
>>>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
>>>> be
>>>>>> used purely because the nature of the workload that it runs very long
>>>> (indefinite).
>>>>>> I did a math: the dma_fence_wait_timeout function's third param is the
>>>> timeout
>>>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
>>>> long
>>>>>> enough, can we just change the timeout parameter to signed 64 bits so it is
>>>> much
>>>>>> longer than our life time...
>>>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>>>>> because the workload runs very long, rather because of the fact that we use
>>>>>> page fault for long-run workload. If we enable page fault for short-run
>>>> workload,
>>>>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>>>>> Now since we use page fault which is *fundamentally* controversial with
>>>>>> dma-fence design, why now just introduce a independent concept such as
>>>> user-
>>>>>> fence instead of extending existing dma-fence?
>>>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work
>>>> for
>>>>>> everything, it is beautiful. But seems we have some fundamental problem
>>>> here.
>>>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>>>>> user space or other drivers. Think of this mode as SW only fence.
>>>>>> Yeah and I truly think this is an really bad idea.
>>>>>>
>>>>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>>>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>>>>> this case the "I've told you so" is appropriate in my opinion.
>>>>>>
>>>>>> If we need infrastructure for long running dependency tracking we should
>>>>>> encapsulate that in a new framework and not try to mangle the existing
>>>>>> code for something it was never intended for.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Oak
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of
>>>>>>>>> Matthew Brost
>>>>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>>>>> To: dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org
>>>>>>>>> Cc: robdclark at chromium.org; thomas.hellstrom at linux.intel.com;
>>>>>> airlied at linux.ie;
>>>>>>>>> lina at asahilina.net; boris.brezillon at collabora.com; Brost, Matthew
>>>>>>>>> <matthew.brost at intel.com>; christian.koenig at amd.com;
>>>>>>>>> faith.ekstrand at collabora.com
>>>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>>>>> have been asked to merge our common DRM scheduler patches first as
>>>> well
>>>>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>>>>> welcome any and all feedback.
>>>>>>>>>
>>>>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>>>>
>>>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>>>>> entity (patches 1-3)
>>>>>>>>>
>>>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>>>>> severals problems as the DRM was originally designed to schedule jobs
>>>> on
>>>>>>>>> hardware queues. The main problem being that DRM scheduler expects
>>>> the
>>>>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete
>>>> out
>>>>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>>>>
>>>>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>>>>
>>>>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>>>>
>>>>>>>>> Idea is to be able to communicate to the submission backend with in
>>>> band
>>>>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>>>>> backend.
>>>>>>>>>
>>>>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>>>>
>>>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>>>>
>>>>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>>>>
>>>>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>>>>> scheduler and never export them for long running workloads. By
>>>>>>>>> annotating these fences as long running we ensure that these dma-
>>>> fences
>>>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>>>>> rules.
>>>>>>>>>
>>>>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>>>>
>>>>>>>>> Enjoy - Matt
>>>>>>>>>
>>>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>>>>> [4]
>>>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>>>>
>>>>>>>>> Matthew Brost (8):
>>>>>>>>>      drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>>>>        kthread
>>>>>>>>>      drm/sched: Move schedule policy to scheduler / entity
>>>>>>>>>      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
>>>> policy
>>>>>>>>>      drm/sched: Add generic scheduler message interface
>>>>>>>>>      drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>>>>      drm/sched: Submit job before starting TDR
>>>>>>>>>      drm/sched: Add helper to set TDR timeout
>>>>>>>>>      drm/syncobj: Warn on long running dma-fences
>>>>>>>>>
>>>>>>>>> Thomas Hellström (2):
>>>>>>>>>      dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>>>>      drm/sched: Support long-running sched entities
>>>>>>>>>
>>>>>>>>>     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>>>>     drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>>>>     drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>>>>     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>>>>     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
>>>> ---
>>>>>>>>>     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>>>>     include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>>>>     include/linux/dma-fence.h                   |  60 ++++-
>>>>>>>>>     16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.34.1