[RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Wed Jul 16 14:06:29 UTC 2025


On 16/07/2025 14:00, Christian König wrote:
> On 16.07.25 14:51, Tvrtko Ursulin wrote:
>>>>>>> be disabled once GFX/SDMA is no longer active.  In this particular
>>>>>>> case there was a race condition somewhere in the internal handshaking
>>>>>>> with SDMA which led to SDMA missing doorbells sometimes and not
>>>>>>> executing the job even if there was work in the ring.
>>>>>>
>>>>>> Thank you, more or less than what I assumed.
>>>>>>
>>>>>> But in this case there should be no harm in holding GFXOFF disabled
>>>>>> until the job completes (like this patch)? Only a win to avoid the SMU
>>>>>> communication latencies while unit is powered on anyway.
>>>>>
>>>>> The extra latency is only on the CPU side, once the
>>>>> amdgpu_ring_commit() is called the SDMA engine is already working.
>>>>
>>>> It is on the CPU side but can create bubbles in the pipeline, no? Is
>>>> there no scope with AMD to have GFX and SDMA jobs depend on each other?
>>>> Because, as said, I've seen some high latencies from the GFXOFF disable
>>>> calls.
>>>
>>> The SDMA job is already executing at that point.  The allow gfxoff
>>> message to the firmware shouldn't come until later because it's
>>> handled by a delayed work thread from end_use().  If you have multiple
>>> submissions to SDMA within the delay window, the begin_use() and
>>> end_use() will just be ref count handling and won't actually talk to
>>> the firmware.
>>
>> I followed up with testing a bunch more games, and is it turns out, Cyberpunk 2077 is the only one which has this submission patterns where default GFX_OFF_DELAY_ENABLE is regularly defeated.
>>
>> There, around 1.2 times per second the SDMA submissions miss that 100ms hysteresis and cause a CPU latency over 100us (I only measured when >100us and ignored the rest). Average latency is ~400us and max is ~2ms. So IMHO quite bad.
> 
> What exactly does Cyberpunk do to hit that? Are those SDMA page table updates, clears or userspace submissions?

I will have to look into that to provide an answer.

>> And the vast majority of those latencies come from the SMU request. Only very rarely someone hits the mutex contention path.
>>
>> So that was the motivation for the RFC. I suppose I could have also proposed to increase the hysteresis, but holding the GFXOFF disabled for the duration of the job sounded preferable for power consmuption.
>>
>> Anyway, given I only found Cyberpunk 2077 suffers from this I guess it maybe isn't to interesting to upstream for you guys. Then again it is limited to specific old SKU so maybe it should not be that controversial either? Only that Christian NAKed tying it to job lifetime. So I don't know, AMDs call.
> 
> Well what you could do is to take a look if we couldn't simplify the SMU and/or adjust the GFX_OFF_DELAY_ENABLED.

SMU stuff, as far as I can follow it, ends up with simply sending some 
messages to the firmware. So I am not sure what and how could be 
optimised there.

Increasing GFX_OFF_DELAY_ENABLED would work, if large enough, but I 
think it could be bad for power usage, depending on the workload.

> On the other hand why does it help to keep GFXOFF disabled while running the SDMA job?

Only because I tied it to both GFX and SDMA.

RFC does this:

1) Marks SDMA as "needs GFXOFF workaround".
2) Propagates "needs GFXOFF workaround" to adev if any active ring has 
it set.
3) If adev has it set, it grabs and extra GFXOFF disable for GFX, 
COMPUTE and SDMA submissions, and marks those jobs as "hold GFXOFF".
4) Releases the GFXOFF when marked jobs are "completed" (well freed, 
since completion is IRQ context so hard).

AFAIU from what Alex said I understood the parts of the chip handling 
GFX and SDMA (not sure about compute) are under the same "power gating 
domain" (right name?).

What would you suggest to log power use during the game? Something like 
once per second or so?

Regards,

Tvrtko



More information about the amd-gfx mailing list