[RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround

Fri Jul 11 15:27:56 UTC 2025

On 11.07.25 15:58, Tvrtko Ursulin wrote:
> 
> On 11/07/2025 14:39, Alex Deucher wrote:
>> On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin
>> <tvrtko.ursulin at igalia.com> wrote:
>>>
>>>
>>> On 11/07/2025 13:45, Christian König wrote:
>>>> On 11.07.25 14:23, Tvrtko Ursulin wrote:
>>>>> Commit
>>>>> 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>>> added a workaround which disables GFXOFF for the duration of the job
>>>>> submit stage (with a 100ms trailing hysteresis).
>>>>>
>>>>> Empirically the GFXOFF disable/enable request can suffer from significant
>>>>> latencies (2ms is easily seen) which are then inserted onto the
>>>>> amdgpu_job_run() path, which slows down the CPU submission of ready jobs.
>>>>>
>>>>> 1)
>>>>> If the premise of the GFXOFF workaround is to keep it disabled while the
>>>>> SDMA engine is active, the current workaround achieves that only
>>>>> partially, for submissions and jobs which take less than 100ms (the GFXOFF
>>>>> re-enable hysteresis), counting from the ring write phase, up to
>>>>> completion.
>>>>>
>>>>> 2)
>>>>> If disabling GFXOFF affects the GFX engine too, basing the workaround
>>>>> solely on the SDMA activity creates, at minimum, a needless "chatter" on
>>>>> the SMU communication channel.
>>>>
>>>> IIRC that is intentional. This "needless" chatter is what the workaround was all about.
>>>
>>> I tried to gather knowledge to how the hardware works from the comment
>>> in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me please.
>>>
>>> To try and explain my questions better. If the GFX ring/engine is busy
>>> is there a point for SDMA to be requesting GFXOFF enable/disable? Or
>>> maybe with diagrams...
>>>
>>> 1)
>>>
>>> SDMA:
>>>
>>>      ring-write     ring-commit       job-execute       job-done
>>>    gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
>>>
>>> Was the workaround prematurely dropped in this case (aka is
>>> ring->funcs->end_use() the right place to drop it from)? Probably
>>> theoretical that a SDMA job takes more than 100ms but I am trying to
>>> understand it all.
>>>
>>
>> The firmware controls the power to a subset of the chip which contains
>> both gfx and sdma.  Normally the firmware dynamically powers up and
>> down gfx transparently when doorbells come in or the engines go idle
>> for either engine.  amdgpu_gfx_off_ctrl() tells the firmware to allow
>> or disallow gfxoff entry.  So what this workaround does is disallow
>> gfxoff (which results in gfx being powered up) before we touch SDMA.
>> Once SDMA is active, we can allow gfxoff again as it will dynamically
> 
> Hmm so it is "once" and not "while", as the comment says:
> 
>     /* SDMA 5.2.3 (RMB) FW doesn't seem to properly
>      * disallow GFXOFF in some cases leading to
>      * hangs in SDMA.  Disallow GFXOFF while SDMA is active.
> 
> ?
> 
> And for "once active" amdgpu_ring_commit() is what it counts?

More or less, yes. It's the end_use callback which counts and that is called by amdgpu_ring_commit() IIRC.

> 
>> be disabled once GFX/SDMA is no longer active.  In this particular
>> case there was a race condition somewhere in the internal handshaking
>> with SDMA which led to SDMA missing doorbells sometimes and not
>> executing the job even if there was work in the ring.
> 
> Thank you, more or less than what I assumed.
> 
> But in this case there should be no harm in holding GFXOFF disabled until the job completes (like this patch)? Only a win to avoid the SMU communication latencies while unit is powered on anyway.

No, that is something we should try to avoid. GFXOFF can be active again as soon as we have ringed the doorbell.

The work item was there basically to just not turn GFXOFF off/on all the time.

Regards,
Christian.

> 
> Regards,
> 
> Tvrtko
> 
>>> 2)
>>>
>>>
>>> GFX:
>>>
>>>      +-----  job executing --------------------------------------+
>>>
>>> SDMA:
>>>
>>>      ring-write     ring-commit       job-execute       job-done
>>>    gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
>>>
>>>
>>> Is it required for the SDMA activity to cause SMU message traffic in
>>> this case, or is the powerdomain implied to be on (GFXOFF cannot turn it
>>> off while GFX is active)?
>>>
>>> This is the case I measured latency spikes. While the GFX load was
>>> running I was seeing SDMA->run_job() spike and traced it to the SMU
>>> communication.
>>>
>>> Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count
>>> dropping to zero until both GFX and SDMA are idle.
>>>
>>> https://imgshare.cc/rdxu2bjl
>>>
>>> Above is visual representation of these latencies. Y is SDMA run_job()
>>> duration in micro-seconds, X is seconds wall time. Blue is stock kernel,
>>> orange is with this patch. X goes for ~60 seconds, which is how long
>>> Cyberpunk 2077 benchmark is.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>> If 1) and 2) hold true, we can improve on the workaround by; a) only
>>>>> re-enabling GFXOFF once the job had actually completed*, and b) apply the
>>>>> same workaround on other rings which share the same GFXOFF powergating
>>>>> domain.
>>>>
>>>> The point of GFXOFF is to turn GFX on/off *without* kernel driver interaction. Otherwise we don't need it in the first place.
>>>>
>>>> We just have a hack for the SDMA because that moved into the GFXOFF domain with Navi and is broken on some HW generations IIRC.
>>>>
>>>>>
>>>>> With these two applied, the GFXOFF re-enable requests are avoided
>>>>> altogether during persistent activity on the GFX ring and simultaneous
>>>>> sporadic activity on the SDMA ring.
>>>>>
>>>>> This has a positive effect of drastically reducing SDMA submission
>>>>> latencies. For example during the Cyberpunk 2077 benchmark, they are
>>>>> reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more
>>>>> importantly the worst case latency, averaged to a one second window, is
>>>>> reduced from 305us to 30us**.
>>>>>
>>>>> *) For ease of implementation we put the re-enable at the job free stage,
>>>>> since doing it on actual completion is problematic in terms of locking.
>>>>
>>>> Absolutely clear NAK to this. Never ever base anything on the job livetime!
>>>>
>>>> We already had enough trouble with that.
>>>>
>>>>>
>>>>> **) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) -
>>>>> Approximately 30 SDMA submissions per second, ewma average logged once
>>>>> per second therefore significantly hides the worst case latency. Eg.
>>>>> the real improvement in max submission latency is severely understated by
>>>>> these numbers.
>>>>
>>>> Well that would indeed be quite nice to have.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>> References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>>> Cc: Mario Limonciello <mario.limonciello at amd.com>
>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  | 1 +
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c   | 8 ++++++++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  | 7 +++++++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.h  | 2 ++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>>>>>    drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 1 +
>>>>>    7 files changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>>> index 08f268dab8f5..eee40f385793 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>>> @@ -475,6 +475,7 @@ struct amdgpu_gfx {
>>>>>       uint32_t                        compute_supported_reset;
>>>>>
>>>>>       /* gfx off */
>>>>> +    bool                            gfx_off_held;       /* true: rings hold gfx_off */
>>>>>       bool                            gfx_off_state;      /* true: enabled, false: disabled */
>>>>>       struct mutex                    gfx_off_mutex;      /* mutex to change gfxoff state */
>>>>>       uint32_t                        gfx_off_req_count;  /* default 1, enable gfx off: dec 1, disable gfx off: add 1 */
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>>> index 206b70acb29a..bf9bffe40235 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>>> @@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>>>               return r;
>>>>>       }
>>>>>
>>>>> +    if (job && adev->gfx.gfx_off_held &&
>>>>> +        (ring->funcs->type == AMDGPU_RING_TYPE_GFX ||
>>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE ||
>>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) {
>>>>> +            amdgpu_gfx_off_ctrl(adev, false);
>>>>> +            job->gfx_off_held = true;
>>>>> +    }
>>>>> +
>>>>>       need_ctx_switch = ring->current_ctx != fence_ctx;
>>>>>       if (ring->funcs->emit_pipeline_sync && job &&
>>>>>           ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 2b58e353cca1..4cfd175ac6df 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>>>       if (!*job)
>>>>>               return -ENOMEM;
>>>>>
>>>>> +    (*job)->adev = adev;
>>>>>       (*job)->vm = vm;
>>>>>
>>>>>       amdgpu_sync_create(&(*job)->explicit_sync);
>>>>> @@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>>>>
>>>>>       amdgpu_sync_free(&job->explicit_sync);
>>>>>
>>>>> +    if (job->gfx_off_held)
>>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
>>>>> +
>>>>
>>>>
>>>>
>>>>
>>>>>       /* only put the hw fence if has embedded fence */
>>>>>       if (!job->hw_fence.base.ops)
>>>>>               kfree(job);
>>>>> @@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>>       if (job->gang_submit != &job->base.s_fence->scheduled)
>>>>>               dma_fence_put(job->gang_submit);
>>>>>
>>>>> +    if (job->gfx_off_held)
>>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
>>>>> +
>>>>>       if (!job->hw_fence.base.ops)
>>>>>               kfree(job);
>>>>>       else
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> index 2f302266662b..d4ab832ac193 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> @@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type;
>>>>>
>>>>>    struct amdgpu_job {
>>>>>       struct drm_sched_job    base;
>>>>> +    struct amdgpu_device    *adev;
>>>>>       struct amdgpu_vm        *vm;
>>>>>       struct amdgpu_sync      explicit_sync;
>>>>>       struct amdgpu_fence     hw_fence;
>>>>> @@ -55,6 +56,7 @@ struct amdgpu_job {
>>>>>       bool                    vm_needs_flush;
>>>>>       bool                    gds_switch_needed;
>>>>>       bool                    spm_update_needed;
>>>>> +    bool                    gfx_off_held;
>>>>>       uint64_t                vm_pd_addr;
>>>>>       unsigned                vmid;
>>>>>       unsigned                pasid;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index 426834806fbf..22cac94e2f2a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
>>>>>       ring->max_dw = max_dw;
>>>>>       ring->hw_prio = hw_prio;
>>>>>
>>>>> +    if (ring->funcs->gfx_off_held)
>>>>> +            adev->gfx.gfx_off_held = true;
>>>>> +
>>>>>       if (!ring->no_scheduler && ring->funcs->type < AMDGPU_HW_IP_NUM) {
>>>>>               hw_ip = ring->funcs->type;
>>>>>               num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 784ba2ec354c..afaf951b0b78 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -202,6 +202,7 @@ struct amdgpu_ring_funcs {
>>>>>       bool                    support_64bit_ptrs;
>>>>>       bool                    no_user_fence;
>>>>>       bool                    secure_submission_supported;
>>>>> +    bool                    gfx_off_held;
>>>>>       unsigned                extra_dw;
>>>>>
>>>>>       /* ring read/write ptr handling */
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> index 42a25150f83a..c88de65e82bc 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> @@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs sdma_v5_2_ring_funcs = {
>>>>>       .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
>>>>>       .support_64bit_ptrs = true,
>>>>>       .secure_submission_supported = true,
>>>>> +    .gfx_off_held = true,
>>>>>       .get_rptr = sdma_v5_2_ring_get_rptr,
>>>>>       .get_wptr = sdma_v5_2_ring_get_wptr,
>>>>>       .set_wptr = sdma_v5_2_ring_set_wptr,
>>>>
>>>
>