[RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround

Fri Jul 11 13:58:36 UTC 2025

On 11/07/2025 14:39, Alex Deucher wrote:
> On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin
> <tvrtko.ursulin at igalia.com> wrote:
>>
>>
>> On 11/07/2025 13:45, Christian König wrote:
>>> On 11.07.25 14:23, Tvrtko Ursulin wrote:
>>>> Commit
>>>> 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>> added a workaround which disables GFXOFF for the duration of the job
>>>> submit stage (with a 100ms trailing hysteresis).
>>>>
>>>> Empirically the GFXOFF disable/enable request can suffer from significant
>>>> latencies (2ms is easily seen) which are then inserted onto the
>>>> amdgpu_job_run() path, which slows down the CPU submission of ready jobs.
>>>>
>>>> 1)
>>>> If the premise of the GFXOFF workaround is to keep it disabled while the
>>>> SDMA engine is active, the current workaround achieves that only
>>>> partially, for submissions and jobs which take less than 100ms (the GFXOFF
>>>> re-enable hysteresis), counting from the ring write phase, up to
>>>> completion.
>>>>
>>>> 2)
>>>> If disabling GFXOFF affects the GFX engine too, basing the workaround
>>>> solely on the SDMA activity creates, at minimum, a needless "chatter" on
>>>> the SMU communication channel.
>>>
>>> IIRC that is intentional. This "needless" chatter is what the workaround was all about.
>>
>> I tried to gather knowledge to how the hardware works from the comment
>> in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me please.
>>
>> To try and explain my questions better. If the GFX ring/engine is busy
>> is there a point for SDMA to be requesting GFXOFF enable/disable? Or
>> maybe with diagrams...
>>
>> 1)
>>
>> SDMA:
>>
>>      ring-write     ring-commit       job-execute       job-done
>>    gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
>>
>> Was the workaround prematurely dropped in this case (aka is
>> ring->funcs->end_use() the right place to drop it from)? Probably
>> theoretical that a SDMA job takes more than 100ms but I am trying to
>> understand it all.
>>
> 
> The firmware controls the power to a subset of the chip which contains
> both gfx and sdma.  Normally the firmware dynamically powers up and
> down gfx transparently when doorbells come in or the engines go idle
> for either engine.  amdgpu_gfx_off_ctrl() tells the firmware to allow
> or disallow gfxoff entry.  So what this workaround does is disallow
> gfxoff (which results in gfx being powered up) before we touch SDMA.
> Once SDMA is active, we can allow gfxoff again as it will dynamically

Hmm so it is "once" and not "while", as the comment says:

	/* SDMA 5.2.3 (RMB) FW doesn't seem to properly
	 * disallow GFXOFF in some cases leading to
	 * hangs in SDMA.  Disallow GFXOFF while SDMA is active.

?

And for "once active" amdgpu_ring_commit() is what it counts?

> be disabled once GFX/SDMA is no longer active.  In this particular
> case there was a race condition somewhere in the internal handshaking
> with SDMA which led to SDMA missing doorbells sometimes and not
> executing the job even if there was work in the ring.

Thank you, more or less than what I assumed.

But in this case there should be no harm in holding GFXOFF disabled 
until the job completes (like this patch)? Only a win to avoid the SMU 
communication latencies while unit is powered on anyway.

Regards,

Tvrtko

>> 2)
>>
>>
>> GFX:
>>
>>      +-----  job executing --------------------------------------+
>>
>> SDMA:
>>
>>      ring-write     ring-commit       job-execute       job-done
>>    gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
>>
>>
>> Is it required for the SDMA activity to cause SMU message traffic in
>> this case, or is the powerdomain implied to be on (GFXOFF cannot turn it
>> off while GFX is active)?
>>
>> This is the case I measured latency spikes. While the GFX load was
>> running I was seeing SDMA->run_job() spike and traced it to the SMU
>> communication.
>>
>> Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count
>> dropping to zero until both GFX and SDMA are idle.
>>
>> https://imgshare.cc/rdxu2bjl
>>
>> Above is visual representation of these latencies. Y is SDMA run_job()
>> duration in micro-seconds, X is seconds wall time. Blue is stock kernel,
>> orange is with this patch. X goes for ~60 seconds, which is how long
>> Cyberpunk 2077 benchmark is.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> If 1) and 2) hold true, we can improve on the workaround by; a) only
>>>> re-enabling GFXOFF once the job had actually completed*, and b) apply the
>>>> same workaround on other rings which share the same GFXOFF powergating
>>>> domain.
>>>
>>> The point of GFXOFF is to turn GFX on/off *without* kernel driver interaction. Otherwise we don't need it in the first place.
>>>
>>> We just have a hack for the SDMA because that moved into the GFXOFF domain with Navi and is broken on some HW generations IIRC.
>>>
>>>>
>>>> With these two applied, the GFXOFF re-enable requests are avoided
>>>> altogether during persistent activity on the GFX ring and simultaneous
>>>> sporadic activity on the SDMA ring.
>>>>
>>>> This has a positive effect of drastically reducing SDMA submission
>>>> latencies. For example during the Cyberpunk 2077 benchmark, they are
>>>> reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more
>>>> importantly the worst case latency, averaged to a one second window, is
>>>> reduced from 305us to 30us**.
>>>>
>>>> *) For ease of implementation we put the re-enable at the job free stage,
>>>> since doing it on actual completion is problematic in terms of locking.
>>>
>>> Absolutely clear NAK to this. Never ever base anything on the job livetime!
>>>
>>> We already had enough trouble with that.
>>>
>>>>
>>>> **) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) -
>>>> Approximately 30 SDMA submissions per second, ewma average logged once
>>>> per second therefore significantly hides the worst case latency. Eg.
>>>> the real improvement in max submission latency is severely understated by
>>>> these numbers.
>>>
>>> Well that would indeed be quite nice to have.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>> Cc: Mario Limonciello <mario.limonciello at amd.com>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c   | 8 ++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  | 7 +++++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.h  | 2 ++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 1 +
>>>>    7 files changed, 23 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> index 08f268dab8f5..eee40f385793 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> @@ -475,6 +475,7 @@ struct amdgpu_gfx {
>>>>       uint32_t                        compute_supported_reset;
>>>>
>>>>       /* gfx off */
>>>> +    bool                            gfx_off_held;       /* true: rings hold gfx_off */
>>>>       bool                            gfx_off_state;      /* true: enabled, false: disabled */
>>>>       struct mutex                    gfx_off_mutex;      /* mutex to change gfxoff state */
>>>>       uint32_t                        gfx_off_req_count;  /* default 1, enable gfx off: dec 1, disable gfx off: add 1 */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> index 206b70acb29a..bf9bffe40235 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> @@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>>               return r;
>>>>       }
>>>>
>>>> +    if (job && adev->gfx.gfx_off_held &&
>>>> +        (ring->funcs->type == AMDGPU_RING_TYPE_GFX ||
>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE ||
>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) {
>>>> +            amdgpu_gfx_off_ctrl(adev, false);
>>>> +            job->gfx_off_held = true;
>>>> +    }
>>>> +
>>>>       need_ctx_switch = ring->current_ctx != fence_ctx;
>>>>       if (ring->funcs->emit_pipeline_sync && job &&
>>>>           ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 2b58e353cca1..4cfd175ac6df 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>>       if (!*job)
>>>>               return -ENOMEM;
>>>>
>>>> +    (*job)->adev = adev;
>>>>       (*job)->vm = vm;
>>>>
>>>>       amdgpu_sync_create(&(*job)->explicit_sync);
>>>> @@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>>>
>>>>       amdgpu_sync_free(&job->explicit_sync);
>>>>
>>>> +    if (job->gfx_off_held)
>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
>>>> +
>>>
>>>
>>>
>>>
>>>>       /* only put the hw fence if has embedded fence */
>>>>       if (!job->hw_fence.base.ops)
>>>>               kfree(job);
>>>> @@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>       if (job->gang_submit != &job->base.s_fence->scheduled)
>>>>               dma_fence_put(job->gang_submit);
>>>>
>>>> +    if (job->gfx_off_held)
>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
>>>> +
>>>>       if (!job->hw_fence.base.ops)
>>>>               kfree(job);
>>>>       else
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> index 2f302266662b..d4ab832ac193 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> @@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type;
>>>>
>>>>    struct amdgpu_job {
>>>>       struct drm_sched_job    base;
>>>> +    struct amdgpu_device    *adev;
>>>>       struct amdgpu_vm        *vm;
>>>>       struct amdgpu_sync      explicit_sync;
>>>>       struct amdgpu_fence     hw_fence;
>>>> @@ -55,6 +56,7 @@ struct amdgpu_job {
>>>>       bool                    vm_needs_flush;
>>>>       bool                    gds_switch_needed;
>>>>       bool                    spm_update_needed;
>>>> +    bool                    gfx_off_held;
>>>>       uint64_t                vm_pd_addr;
>>>>       unsigned                vmid;
>>>>       unsigned                pasid;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> index 426834806fbf..22cac94e2f2a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> @@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
>>>>       ring->max_dw = max_dw;
>>>>       ring->hw_prio = hw_prio;
>>>>
>>>> +    if (ring->funcs->gfx_off_held)
>>>> +            adev->gfx.gfx_off_held = true;
>>>> +
>>>>       if (!ring->no_scheduler && ring->funcs->type < AMDGPU_HW_IP_NUM) {
>>>>               hw_ip = ring->funcs->type;
>>>>               num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 784ba2ec354c..afaf951b0b78 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -202,6 +202,7 @@ struct amdgpu_ring_funcs {
>>>>       bool                    support_64bit_ptrs;
>>>>       bool                    no_user_fence;
>>>>       bool                    secure_submission_supported;
>>>> +    bool                    gfx_off_held;
>>>>       unsigned                extra_dw;
>>>>
>>>>       /* ring read/write ptr handling */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> index 42a25150f83a..c88de65e82bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> @@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs sdma_v5_2_ring_funcs = {
>>>>       .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
>>>>       .support_64bit_ptrs = true,
>>>>       .secure_submission_supported = true,
>>>> +    .gfx_off_held = true,
>>>>       .get_rptr = sdma_v5_2_ring_get_rptr,
>>>>       .get_wptr = sdma_v5_2_ring_get_wptr,
>>>>       .set_wptr = sdma_v5_2_ring_set_wptr,
>>>
>>