[PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled

Fri Aug 13 13:34:08 UTC 2021

On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
> 
> 
> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>> From: Michel Dänzer <mdaenzer at redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>> enabled to disabled. This makes sure the delayed work will be scheduled
>> as intended in the reverse case.
>>
>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>> to use mutex_trylock instead of mutex_lock.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
>>
>> Signed-off-by: Michel Dänzer <mdaenzer at redhat.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>   3 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f3fd5ec710b6..8b025f70706c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>       struct amdgpu_device *adev =
>>           container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>> +         */
>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>> +        return;
> 
> This is not needed and is just creating another thread to contend for mutex.

Still not sure what you mean by that. What other thread?

> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.

mutex_trylock is needed to prevent the deadlock discussed before and below.

schedule_delayed_work is needed due to this scenario hinted at by the comment:

1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails

GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).

(cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)

> The other problem is amdgpu_get_gfx_off_status() also uses the same mutex.

Not sure what for TBH. AFAICT there's only one implementation of this for Renoir, which just reads a register. (It's only called from debugfs)

> So it won't be knowing which thread it would be contending against and blindly creates more work items. 

There is only ever at most one instance of the delayed work at any time. amdgpu_device_delay_enable_gfx_off doesn't care whether amdgpu_gfx_off_ctrl or amdgpu_get_gfx_off_status is holding the mutex, it just keeps re-scheduling itself 100 ms later until it succeeds.

>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>           adev->gfx.gfx_off_req_count--;
>>         if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>> +    } else if (!enable) {
>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> 
> This has the deadlock problem as discussed in the other thread.

It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex, 
mutex_trylock fails and the former bails.

-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer