[PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled

Tue Aug 17 08:35:35 UTC 2021

On 2021-08-17 10:17 a.m., Lazar, Lijo wrote:
> On 8/17/2021 1:21 PM, Quan, Evan wrote:
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>> Michel Dänzer
>>> Sent: Monday, August 16, 2021 6:35 PM
>>> To: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>>> <Christian.Koenig at amd.com>
>>> Cc: Liu, Leo <Leo.Liu at amd.com>; Zhu, James <James.Zhu at amd.com>; amd-
>>> gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org
>>> Subject: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is
>>> disabled
>>>
>>> From: Michel Dänzer <mdaenzer at redhat.com>
>>>
>>> schedule_delayed_work does not push back the work if it was already
>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>> was disabled and re-enabled again during those 100 ms.
>>>
>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>> disabling it again (for getting the GPU clock counter).
>>>
>>> To fix this, call cancel_delayed_work_sync when the disable count
>>> transitions from 0 to 1, and only schedule the delayed work on the
>>> reverse transition, not if the disable count was already 0. This makes
>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>> be lock-free.
>>>
>>> v2:
>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>    mod_delayed_work.
>>> v3:
>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>
>>> Cc: stable at vger.kernel.org
>>> Signed-off-by: Michel Dänzer <mdaenzer at redhat.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-
>>> ----
>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f3fd5ec710b6..f944ed858f3e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2777,12 +2777,11 @@ static void
>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>       struct amdgpu_device *adev =
>>>           container_of(work, struct amdgpu_device,
>>> gfx.gfx_off_delay_work.work);
>>>
>>> -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>> -            adev->gfx.gfx_off_state = true;
>>> -    }
>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>> +
>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>> +        adev->gfx.gfx_off_state = true;
>>>   }
>>>
>>>   /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index a0be0772c8b3..ca91aafcb32b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
>>> *adev, bool enable)
>>>
>>>       mutex_lock(&adev->gfx.gfx_off_mutex);
>>>
>>> -    if (!enable)
>>> -        adev->gfx.gfx_off_req_count++;
>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>> +    if (enable) {
>>> +        /* If the count is already 0, it means there's an imbalance bug
>>> somewhere.
>>> +         * Note that the bug may be in a different caller than the one
>>> which triggers the
>>> +         * WARN_ON_ONCE.
>>> +         */
>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>> +            goto unlock;
>>> +
>>>           adev->gfx.gfx_off_req_count--;
>>> +    } else {
>>> +        adev->gfx.gfx_off_req_count++;
>>> +    }
>>>
>>>       if (enable && !adev->gfx.gfx_off_state && !adev-
>>>> gfx.gfx_off_req_count) {
>>>           schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
>>> GFX_OFF_DELAY_ENABLE);
>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>> +    } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
>> [Quan, Evan] It seems here will leave a small time window for race condition. If amdgpu_device_delay_enable_gfx_off() happens to occur here, it will "WARN_ON_ONCE(adev->gfx.gfx_off_req_count);". How about something as below?
>> @@ -573,13 +573,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>                          goto unlock;
>>
>>                  adev->gfx.gfx_off_req_count--;
>> -       } else {
>> -               adev->gfx.gfx_off_req_count++;
>>          }
>>
>>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>                  schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> -       } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
>> +       } else if (!enable && adev->gfx.gfx_off_req_count == 0) {
>>                  cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>
>>                  if (adev->gfx.gfx_off_state &&
>> @@ -593,6 +591,9 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>                  }
>>          }
>>
>> +       if (!enable)
>> +               adev->gfx.gfx_off_req_count++;
>> +
>>   unlock:
>>
> 
> Hi Evan,
> 
> It's not a race per se, it is just an undesirable condition of Enable Gfxoff immediately followed by a Disable GfxOff. The purpose of the WARN is to intimate the user about it.

What Evan pointed out (good catch, thanks!) is technically a race condition WRT adev->gfx.gfx_off_req_count, even though in this case it would have only triggered the sanity checks in place to catch bugs like it, it wouldn't otherwise have affected the correctness of the code.

Fixed in v4.

> There are other cases - for ex: if amdgpu_device_delay_enable_gfx_off() called amdgpu_dpm_set_powergating_by_smu() already at the same place you pointed out.

That OTOH is indeed not a race condition, just unlucky timing.

-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer