[RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Fri Jul 11 13:58:36 UTC 2025
On 11/07/2025 14:39, Alex Deucher wrote:
> On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin
> <tvrtko.ursulin at igalia.com> wrote:
>>
>>
>> On 11/07/2025 13:45, Christian König wrote:
>>> On 11.07.25 14:23, Tvrtko Ursulin wrote:
>>>> Commit
>>>> 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>> added a workaround which disables GFXOFF for the duration of the job
>>>> submit stage (with a 100ms trailing hysteresis).
>>>>
>>>> Empirically the GFXOFF disable/enable request can suffer from significant
>>>> latencies (2ms is easily seen) which are then inserted onto the
>>>> amdgpu_job_run() path, which slows down the CPU submission of ready jobs.
>>>>
>>>> 1)
>>>> If the premise of the GFXOFF workaround is to keep it disabled while the
>>>> SDMA engine is active, the current workaround achieves that only
>>>> partially, for submissions and jobs which take less than 100ms (the GFXOFF
>>>> re-enable hysteresis), counting from the ring write phase, up to
>>>> completion.
>>>>
>>>> 2)
>>>> If disabling GFXOFF affects the GFX engine too, basing the workaround
>>>> solely on the SDMA activity creates, at minimum, a needless "chatter" on
>>>> the SMU communication channel.
>>>
>>> IIRC that is intentional. This "needless" chatter is what the workaround was all about.
>>
>> I tried to gather knowledge to how the hardware works from the comment
>> in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me please.
>>
>> To try and explain my questions better. If the GFX ring/engine is busy
>> is there a point for SDMA to be requesting GFXOFF enable/disable? Or
>> maybe with diagrams...
>>
>> 1)
>>
>> SDMA:
>>
>> ring-write ring-commit job-execute job-done
>> gfxoff-off-req gfxoff-on-req >100ms -> gfxoff-on
>>
>> Was the workaround prematurely dropped in this case (aka is
>> ring->funcs->end_use() the right place to drop it from)? Probably
>> theoretical that a SDMA job takes more than 100ms but I am trying to
>> understand it all.
>>
>
> The firmware controls the power to a subset of the chip which contains
> both gfx and sdma. Normally the firmware dynamically powers up and
> down gfx transparently when doorbells come in or the engines go idle
> for either engine. amdgpu_gfx_off_ctrl() tells the firmware to allow
> or disallow gfxoff entry. So what this workaround does is disallow
> gfxoff (which results in gfx being powered up) before we touch SDMA.
> Once SDMA is active, we can allow gfxoff again as it will dynamically
Hmm so it is "once" and not "while", as the comment says:
/* SDMA 5.2.3 (RMB) FW doesn't seem to properly
* disallow GFXOFF in some cases leading to
* hangs in SDMA. Disallow GFXOFF while SDMA is active.
?
And for "once active" amdgpu_ring_commit() is what it counts?
> be disabled once GFX/SDMA is no longer active. In this particular
> case there was a race condition somewhere in the internal handshaking
> with SDMA which led to SDMA missing doorbells sometimes and not
> executing the job even if there was work in the ring.
Thank you, more or less than what I assumed.
But in this case there should be no harm in holding GFXOFF disabled
until the job completes (like this patch)? Only a win to avoid the SMU
communication latencies while unit is powered on anyway.
Regards,
Tvrtko
>> 2)
>>
>>
>> GFX:
>>
>> +----- job executing --------------------------------------+
>>
>> SDMA:
>>
>> ring-write ring-commit job-execute job-done
>> gfxoff-off-req gfxoff-on-req >100ms -> gfxoff-on
>>
>>
>> Is it required for the SDMA activity to cause SMU message traffic in
>> this case, or is the powerdomain implied to be on (GFXOFF cannot turn it
>> off while GFX is active)?
>>
>> This is the case I measured latency spikes. While the GFX load was
>> running I was seeing SDMA->run_job() spike and traced it to the SMU
>> communication.
>>
>> Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count
>> dropping to zero until both GFX and SDMA are idle.
>>
>> https://imgshare.cc/rdxu2bjl
>>
>> Above is visual representation of these latencies. Y is SDMA run_job()
>> duration in micro-seconds, X is seconds wall time. Blue is stock kernel,
>> orange is with this patch. X goes for ~60 seconds, which is how long
>> Cyberpunk 2077 benchmark is.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> If 1) and 2) hold true, we can improve on the workaround by; a) only
>>>> re-enabling GFXOFF once the job had actually completed*, and b) apply the
>>>> same workaround on other rings which share the same GFXOFF powergating
>>>> domain.
>>>
>>> The point of GFXOFF is to turn GFX on/off *without* kernel driver interaction. Otherwise we don't need it in the first place.
>>>
>>> We just have a hack for the SDMA because that moved into the GFXOFF domain with Navi and is broken on some HW generations IIRC.
>>>
>>>>
>>>> With these two applied, the GFXOFF re-enable requests are avoided
>>>> altogether during persistent activity on the GFX ring and simultaneous
>>>> sporadic activity on the SDMA ring.
>>>>
>>>> This has a positive effect of drastically reducing SDMA submission
>>>> latencies. For example during the Cyberpunk 2077 benchmark, they are
>>>> reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more
>>>> importantly the worst case latency, averaged to a one second window, is
>>>> reduced from 305us to 30us**.
>>>>
>>>> *) For ease of implementation we put the re-enable at the job free stage,
>>>> since doing it on actual completion is problematic in terms of locking.
>>>
>>> Absolutely clear NAK to this. Never ever base anything on the job livetime!
>>>
>>> We already had enough trouble with that.
>>>
>>>>
>>>> **) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) -
>>>> Approximately 30 SDMA submissions per second, ewma average logged once
>>>> per second therefore significantly hides the worst case latency. Eg.
>>>> the real improvement in max submission latency is severely understated by
>>>> these numbers.
>>>
>>> Well that would indeed be quite nice to have.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
>>>> Cc: Mario Limonciello <mario.limonciello at amd.com>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 1 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 8 ++++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 +++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 ++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 1 +
>>>> 7 files changed, 23 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> index 08f268dab8f5..eee40f385793 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
>>>> @@ -475,6 +475,7 @@ struct amdgpu_gfx {
>>>> uint32_t compute_supported_reset;
>>>>
>>>> /* gfx off */
>>>> + bool gfx_off_held; /* true: rings hold gfx_off */
>>>> bool gfx_off_state; /* true: enabled, false: disabled */
>>>> struct mutex gfx_off_mutex; /* mutex to change gfxoff state */
>>>> uint32_t gfx_off_req_count; /* default 1, enable gfx off: dec 1, disable gfx off: add 1 */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> index 206b70acb29a..bf9bffe40235 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> @@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>> return r;
>>>> }
>>>>
>>>> + if (job && adev->gfx.gfx_off_held &&
>>>> + (ring->funcs->type == AMDGPU_RING_TYPE_GFX ||
>>>> + ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE ||
>>>> + ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) {
>>>> + amdgpu_gfx_off_ctrl(adev, false);
>>>> + job->gfx_off_held = true;
>>>> + }
>>>> +
>>>> need_ctx_switch = ring->current_ctx != fence_ctx;
>>>> if (ring->funcs->emit_pipeline_sync && job &&
>>>> ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 2b58e353cca1..4cfd175ac6df 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>> if (!*job)
>>>> return -ENOMEM;
>>>>
>>>> + (*job)->adev = adev;
>>>> (*job)->vm = vm;
>>>>
>>>> amdgpu_sync_create(&(*job)->explicit_sync);
>>>> @@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>>>
>>>> amdgpu_sync_free(&job->explicit_sync);
>>>>
>>>> + if (job->gfx_off_held)
>>>> + amdgpu_gfx_off_ctrl(job->adev, true);
>>>> +
>>>
>>>
>>>
>>>
>>>> /* only put the hw fence if has embedded fence */
>>>> if (!job->hw_fence.base.ops)
>>>> kfree(job);
>>>> @@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>> if (job->gang_submit != &job->base.s_fence->scheduled)
>>>> dma_fence_put(job->gang_submit);
>>>>
>>>> + if (job->gfx_off_held)
>>>> + amdgpu_gfx_off_ctrl(job->adev, true);
>>>> +
>>>> if (!job->hw_fence.base.ops)
>>>> kfree(job);
>>>> else
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> index 2f302266662b..d4ab832ac193 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> @@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type;
>>>>
>>>> struct amdgpu_job {
>>>> struct drm_sched_job base;
>>>> + struct amdgpu_device *adev;
>>>> struct amdgpu_vm *vm;
>>>> struct amdgpu_sync explicit_sync;
>>>> struct amdgpu_fence hw_fence;
>>>> @@ -55,6 +56,7 @@ struct amdgpu_job {
>>>> bool vm_needs_flush;
>>>> bool gds_switch_needed;
>>>> bool spm_update_needed;
>>>> + bool gfx_off_held;
>>>> uint64_t vm_pd_addr;
>>>> unsigned vmid;
>>>> unsigned pasid;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> index 426834806fbf..22cac94e2f2a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> @@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
>>>> ring->max_dw = max_dw;
>>>> ring->hw_prio = hw_prio;
>>>>
>>>> + if (ring->funcs->gfx_off_held)
>>>> + adev->gfx.gfx_off_held = true;
>>>> +
>>>> if (!ring->no_scheduler && ring->funcs->type < AMDGPU_HW_IP_NUM) {
>>>> hw_ip = ring->funcs->type;
>>>> num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 784ba2ec354c..afaf951b0b78 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -202,6 +202,7 @@ struct amdgpu_ring_funcs {
>>>> bool support_64bit_ptrs;
>>>> bool no_user_fence;
>>>> bool secure_submission_supported;
>>>> + bool gfx_off_held;
>>>> unsigned extra_dw;
>>>>
>>>> /* ring read/write ptr handling */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> index 42a25150f83a..c88de65e82bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> @@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs sdma_v5_2_ring_funcs = {
>>>> .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
>>>> .support_64bit_ptrs = true,
>>>> .secure_submission_supported = true,
>>>> + .gfx_off_held = true,
>>>> .get_rptr = sdma_v5_2_ring_get_rptr,
>>>> .get_wptr = sdma_v5_2_ring_get_wptr,
>>>> .set_wptr = sdma_v5_2_ring_set_wptr,
>>>
>>
More information about the amd-gfx
mailing list