[RFC] amdgpu: Add a context flag to disable implicit sync

Mon Aug 19 16:00:04 UTC 2024

On Mon, Aug 19, 2024 at 10:00 AM Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
wrote:

>
>
> On Mon, Aug 19, 2024 at 4:51 PM Christian König <
> ckoenig.leichtzumerken at gmail.com> wrote:
>
>> Am 07.08.24 um 22:33 schrieb Bas Nieuwenhuizen:
>>
>> On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <faith at gfxstrand.net>
>> wrote:
>>
>>> On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua at froggi.es> wrote:
>>>
>>>> I was thinking about this more recently. I was initially considering
>>>> "maybe this should be a per-BO import," but I couldn't think of anything in
>>>> the GL model that would actually benefit given its not "true" bindless and
>>>> there's no update-after-bind there.
>>>>
>>>
>>> That's also an option and it's the way it works on i915. However, then
>>> you have to pass lists of things to the kernel and that's kinda gross. If
>>> we need it, we can do that. Otherwise, I think a more global thing is
>>> fine.  I think Bas is currently investigating a per-submit approach which
>>> is a tad different from either but should also work okay.
>>>
>>>
>>
>> Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences
>> instead of using the EXPLICIT wait mode to ensure we also don't add
>> implicit fences).
>>
>>
>> Yeah agree. Your implementation with the per submission flag and using
>> BOOKKEEP actually sounds like the right thing to do to me as well.
>>
>> We need to keep in mind that synchronization goes in both ways, e.g.
>> explicit->implicit as well as implicit->explicit.
>>
>> I would rather like to keep the implicit->explicit handling (which this
>> patch here completely disables) and only allow the explicit->implicit
>> handling (which by using BOOKKEEP fences).
>>
>> This way it is possible that we still over synchronize for example for a
>> double buffered BO is re-used by an explicit client and implicit display
>> server, but that's probably not something we should optimize in the first
>> place.
>>
>
> This oversynchronization actually happens easily as in bindless Vulkan we
> have to mark all buffers as "used". We have some hacks to avoid the worst
> of it but it can be pretty meh.
>

Yeah, this case is actually really important. When I initially did the
dma-buf fence import/export work on Intel, it was a massive speed-up in
DOOM 2016, precisely from removing this bit of over-sync.

~Faith

> In my series on the ML[1] I think I actually got both sides by waiting on
> KERNEL fences only and setting BOOKKEEP fences. (Yeah it actually ends up
> kinda orthogonal on the sync mode but it is what it is ...).
>
> - Bas
>
> [1]https://patchwork.freedesktop.org/series/137014/
>
>
>> Regards,
>> Christian.
>>
>>
>> We do have a per-BO list on submission already, so we could add things
>> there, it is just very annoying to implement as currently at the point we
>> do fence wait/signal we lost the association with the BO list. Given that
>> I  don't see an use case anytime soon (there are some theoreticals like
>> radeonsi might start doing READ usage instead of RW usage with extra
>> tracking) I feel it isn't worth that added implementation complexity
>>
>>
>> Worth others more familiar with GL asking that question to themselves
>>>> also. I am definitely not totally up on what's possible there.
>>>>
>>>> Overall, I think I am OK with this approach, even though I think mixing
>>>> implicit and explicit sync is gross, and I want the pain that is implicit
>>>> sync to just go away forever. :-)
>>>>
>>>
>>> So say we all...
>>>
>>> ~Faith
>>>
>>>
>>>
>>>> - Joshie 🐸✨
>>>>
>>>>
>>>> On August 7, 2024 4:39:32 PM GMT+01:00, Faith Ekstrand <
>>>> faith at gfxstrand.net> wrote:
>>>> >Previously, AMDGPU_GEM_CREATE_EXPLICIT_SYNC was used to disable
>>>> implicit
>>>> >synchronization on BOs when explicit synchronization can be used.  The
>>>> >problem is that this flag is per-BO and affects all amdgpu users in the
>>>> >system, not just the usermode drver which sets it.  This can lead to
>>>> >some unintended consequences for userspace if not used carefully.
>>>> >
>>>> >Since the introduction of DMA_BUF_IOCTL_EXPORT_SYNC_FILE and
>>>> >DMA_BUF_IOCTL_IMPORT_SYNC_FILE, many userspace window system components
>>>> >have grown the ability to convert between the Vulkan explicit sync
>>>> model
>>>> >and the legacy implicit sync model used by X11 and Wayland in the past.
>>>> >This allows both old and new components to exist simultaneously and
>>>> talk
>>>> >to each other.  In particular, XWayland is able to convert between the
>>>> >two to let Vulkan apps work seamlessly with older X11 compositors that
>>>> >aren't aware of explicit synchronizaiton.  This is rapidly becoming the
>>>> >backbone of synchronization in the Linux window system space.
>>>> >
>>>> >Unfortunately, AMDGPU_GEM_CREATE_EXPLICIT_SYNC breaks this because it
>>>> >disables all implicit synchronization on the given BO, regardless of
>>>> >which userspace driver is rendering with it and regardless of the
>>>> >assumptions made by the client application.  In particular, this is
>>>> >causing issues with RADV and radeonsi.  RADV sets the flag to disable
>>>> >implicit sync on window system buffers so that it can safely have them
>>>> >resident at all times without causing internal over-synchronization.
>>>> >The BO is then handed off to a compositor which composites using
>>>> >radeonsi.  If the compositor uses the EGL_ANDROID_native_fence_sync
>>>> >extension to pass explicit sync files through to radeonsi, then
>>>> >everything is fine.  However, if that buffer ever gets handed to an
>>>> >instance of radeonsi with any assumption of implicit synchronization,
>>>> >radeonsi won't be able sync on the BO because RADV disabled implict
>>>> sync
>>>> >on that BO system-wide.  It doesn't matter whether some window system
>>>> >component used DMA_BUF_IOCTL_IMPORT_SYNC_FILE to set the appropriate
>>>> >fence on the BO, amdgpu will ignore it.
>>>> >
>>>> >This new flag disables implicit sync on the context rather than the BO.
>>>> >This way RADV can disable implicit sync (which is what RADV has always
>>>> >wanted) without affecting other components in the system.  If RADV (or
>>>> >some other driver) wants implicit sync on some BO, it can use
>>>> >DMA_BUF_IOCTL_EXPORT_SYNC_FILE and DMA_BUF_IOCTL_IMPORT_SYNC_FILE to
>>>> >manually synchronize with other implicit-sync components.  This is the
>>>> >approach we've taken with NVK/nouveau, ANV/xe, and similar to the
>>>> >approach taken by ANV/i915 and it works well for those drivers.
>>>> >
>>>> >Ideally, I would like to see something like this back-ported to at
>>>> least
>>>> >the kernel where DMA_BUF_IOCTL_IMPORT/EXPORT_SYNC_FILE were introduced
>>>> >so that we don't have to wait another year for the fix to reach users.
>>>> >However, I understand that back-porting UAPI is problematic and I'll
>>>> >leave that decision up to the amdgpu maintainers.  Michel suggested
>>>> that
>>>> >a new CTX_OP would make more sense if we want to back-port it but the
>>>> >create flag made more sense to me from an API design PoV.
>>>> >
>>>> >Signed-off-by: Faith Ekstrand <faith.ekstrand at collabora.com>
>>>> >Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> >Cc: Christian König <christian.koenig at amd.com>
>>>> >Cc: David Airlie <airlied at gmail.com>
>>>> >Cc: Michel Dänzer <mdaenzer at redhat.com>
>>>> >Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>>>> >---
>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  3 ++-
>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ++++++++----
>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  7 +++++++
>>>> > include/uapi/drm/amdgpu_drm.h           | 12 +++++++++++-
>>>> > 4 files changed, 28 insertions(+), 6 deletions(-)
>>>> >
>>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> >index ec888fc6ead8..8410b4426541 100644
>>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> >@@ -1196,7 +1196,8 @@ static int amdgpu_cs_sync_rings(struct
>>>> amdgpu_cs_parser *p)
>>>> >               struct dma_resv *resv = bo->tbo.base.resv;
>>>> >               enum amdgpu_sync_mode sync_mode;
>>>> >
>>>> >-              sync_mode = amdgpu_bo_explicit_sync(bo) ?
>>>> >+              sync_mode = (amdgpu_ctx_explicit_sync(p->ctx) ||
>>>> >+                           amdgpu_bo_explicit_sync(bo)) ?
>>>> >                       AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
>>>> >               r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
>>>> >                                    &fpriv->vm);
>>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> >index 5cb33ac99f70..a304740ccedf 100644
>>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> >@@ -318,7 +318,8 @@ static int amdgpu_ctx_get_stable_pstate(struct
>>>> amdgpu_ctx *ctx,
>>>> > }
>>>> >
>>>> > static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t
>>>> priority,
>>>> >-                         struct drm_file *filp, struct amdgpu_ctx
>>>> *ctx)
>>>> >+                         uint32_t flags, struct drm_file *filp,
>>>> >+                         struct amdgpu_ctx *ctx)
>>>> > {
>>>> >       struct amdgpu_fpriv *fpriv = filp->driver_priv;
>>>> >       u32 current_stable_pstate;
>>>> >@@ -334,6 +335,7 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr
>>>> *mgr, int32_t priority,
>>>> >       ctx->mgr = mgr;
>>>> >       spin_lock_init(&ctx->ring_lock);
>>>> >
>>>> >+      ctx->flags = flags;
>>>> >       ctx->reset_counter = atomic_read(&mgr->adev->gpu_reset_counter);
>>>> >       ctx->reset_counter_query = ctx->reset_counter;
>>>> >       ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
>>>> >@@ -474,6 +476,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device
>>>> *adev,
>>>> >                           struct amdgpu_fpriv *fpriv,
>>>> >                           struct drm_file *filp,
>>>> >                           int32_t priority,
>>>> >+                          uint32_t flags,
>>>> >                           uint32_t *id)
>>>> > {
>>>> >       struct amdgpu_ctx_mgr *mgr = &fpriv->ctx_mgr;
>>>> >@@ -493,7 +496,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device
>>>> *adev,
>>>> >       }
>>>> >
>>>> >       *id = (uint32_t)r;
>>>> >-      r = amdgpu_ctx_init(mgr, priority, filp, ctx);
>>>> >+      r = amdgpu_ctx_init(mgr, priority, flags, filp, ctx);
>>>> >       if (r) {
>>>> >               idr_remove(&mgr->ctx_handles, *id);
>>>> >               *id = 0;
>>>> >@@ -666,7 +669,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> >                    struct drm_file *filp)
>>>> > {
>>>> >       int r;
>>>> >-      uint32_t id, stable_pstate;
>>>> >+      uint32_t id, stable_pstate, flags;
>>>> >       int32_t priority;
>>>> >
>>>> >       union drm_amdgpu_ctx *args = data;
>>>> >@@ -675,6 +678,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> >
>>>> >       id = args->in.ctx_id;
>>>> >       priority = args->in.priority;
>>>> >+      flags = args->in.flags;
>>>> >
>>>> >       /* For backwards compatibility, we need to accept ioctls with
>>>> garbage
>>>> >        * in the priority field. Garbage values in the priority field,
>>>> result
>>>> >@@ -685,7 +689,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> >
>>>> >       switch (args->in.op) {
>>>> >       case AMDGPU_CTX_OP_ALLOC_CTX:
>>>> >-              r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, &id);
>>>> >+              r = amdgpu_ctx_alloc(adev, fpriv, filp, priority,
>>>> flags, &id);
>>>> >               args->out.alloc.ctx_id = id;
>>>> >               break;
>>>> >       case AMDGPU_CTX_OP_FREE_CTX:
>>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> >index 85376baaa92f..9431c8d2ea59 100644
>>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> >@@ -45,6 +45,7 @@ struct amdgpu_ctx_entity {
>>>> > struct amdgpu_ctx {
>>>> >       struct kref                     refcount;
>>>> >       struct amdgpu_ctx_mgr           *mgr;
>>>> >+      uint32_t                        flags;
>>>> >       unsigned                        reset_counter;
>>>> >       unsigned                        reset_counter_query;
>>>> >       uint64_t                        generation;
>>>> >@@ -84,6 +85,12 @@ struct dma_fence *amdgpu_ctx_get_fence(struct
>>>> amdgpu_ctx *ctx,
>>>> > bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio);
>>>> > void amdgpu_ctx_priority_override(struct amdgpu_ctx *ctx, int32_t
>>>> ctx_prio);
>>>> >
>>>> >+static inline bool
>>>> >+amdgpu_ctx_explicit_sync(struct amdgpu_ctx *ctx)
>>>> >+{
>>>> >+      return ctx->flags & AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC;
>>>> >+}
>>>> >+
>>>> > int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
>>>> >                    struct drm_file *filp);
>>>> >
>>>> >diff --git a/include/uapi/drm/amdgpu_drm.h
>>>> b/include/uapi/drm/amdgpu_drm.h
>>>> >index 96e32dafd4f0..e9d87a6e3d86 100644
>>>> >--- a/include/uapi/drm/amdgpu_drm.h
>>>> >+++ b/include/uapi/drm/amdgpu_drm.h
>>>> >@@ -125,7 +125,14 @@ extern "C" {
>>>> > #define AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS     (1 << 5)
>>>> > /* Flag that BO is always valid in this VM */
>>>> > #define AMDGPU_GEM_CREATE_VM_ALWAYS_VALID     (1 << 6)
>>>> >-/* Flag that BO sharing will be explicitly synchronized */
>>>> >+/* Flag that BO sharing will be explicitly synchronized
>>>> >+ *
>>>> >+ * This flag should not be used unless the client can guarantee that
>>>> no
>>>> >+ * other driver which ever touches this BO will ever want to use
>>>> implicit
>>>> >+ * synchronization as it disables implicit sync on this BO
>>>> system-wide.
>>>> >+ * Instead, drivers which use an explicit synchronization model should
>>>> >+ * prefer AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC.
>>>> >+ */
>>>> > #define AMDGPU_GEM_CREATE_EXPLICIT_SYNC               (1 << 7)
>>>> > /* Flag that indicates allocating MQD gart on GFX9, where the mtype
>>>> >  * for the second page onward should be set to NC. It should never
>>>> >@@ -240,6 +247,9 @@ union drm_amdgpu_bo_list {
>>>> > #define AMDGPU_CTX_OP_GET_STABLE_PSTATE       5
>>>> > #define AMDGPU_CTX_OP_SET_STABLE_PSTATE       6
>>>> >
>>>> >+/* indicate that all synchronization will be explicit */
>>>> >+#define AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC (1<<0)
>>>> >+
>>>> > /* GPU reset status */
>>>> > #define AMDGPU_CTX_NO_RESET           0
>>>> > /* this the context caused it */
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20240819/2c5eb819/attachment-0001.htm>