[PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
Michel Dänzer
michel at daenzer.net
Fri Aug 13 16:00:44 UTC 2021
On 2021-08-13 5:07 p.m., Lazar, Lijo wrote:
>
>
> On 8/13/2021 8:10 PM, Michel Dänzer wrote:
>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>>> From: Michel Dänzer <mdaenzer at redhat.com>
>>>>>>
>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>
>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>
>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>>> as intended in the reverse case.
>>>>>>
>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>>
>>>>>> v2:
>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>> mod_delayed_work.
>>>>>>
>>>>>> Signed-off-by: Michel Dänzer <mdaenzer at redhat.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 13 +++++++------
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 3 +++
>>>>>> 3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>> struct amdgpu_device *adev =
>>>>>> container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>> - mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>> + /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>>>> + if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>>> + /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>>>> + * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>>>> + * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>>>> + */
>>>>>> + schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>>> + return;
>>>>>
>>>>> This is not needed and is just creating another thread to contend for mutex.
>>>>
>>>> Still not sure what you mean by that. What other thread?
>>>
>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
>>>
>>>>
>>>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>>>
>>>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>>>
>>>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>>>
>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>>>
>>>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>>
>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>>>
>>>
>>> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.
>>
>> Not really.
>>
>>
>>> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.
>>
>> Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).
>>
>
> I meant the original implementation of amdgpu_device_delay_enable_gfx_off().
>
>
> If you indeed want to use _sync, there is a small problem with this implementation also which is roughly equivalent to the original problem you faced.
>
> amdgpu_gfx_off_ctrl(disable) locks mutex
> calls cancel_delayed_work_sync
> amdgpu_device_delay_enable_gfx_off already started running
> mutex_trylock fails and schedules another one
> amdgpu_gfx_off_ctrl(enable)
> schedules_delayed_work() - Delay is not extended, it's the same as when it's rearmed from work item.
This cannot happen. When cancel_delayed_work_sync returns, it guarantees that the delayed work is not scheduled
, even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. In other words, it cancels that as well.
> Probably, overthinking about the solution. Looking back, mod_ version is simpler :). May be just delay it further everytime there is a call with enable instead of doing it only for req_cnt==0?
That has some issues as well:
* Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the former starts running between when the latter locks the mutex and calls mod_delayed_work.
* If the work is not scheduled yet, mod_delayed_work would schedule it, even if req_count > 0, in which case it couldn't actually enable GFXOFF.
Conceptually, making sure the work is never scheduled while req_count > 0 seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring functions (which are presumably hotter paths than these amdgpu_gfx_off functions) I needlessly modified in patch 2.
(It also means amdgpu_device_delay_enable_gfx_off technically no longer needs to test req_count or gfx_off_state; I can spin a v3 for that if desired)
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and X developer
More information about the dri-devel
mailing list