[PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled

Michel Dänzer michel at daenzer.net
Tue Aug 17 11:06:10 UTC 2021


On 2021-08-17 12:37 p.m., Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 3:29 PM, Michel Dänzer wrote:
>> On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
>>>
>>>
>>> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>>>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>>>
>>>>>
>>>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>>>> From: Michel Dänzer <mdaenzer at redhat.com>
>>>>>>
>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>
>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>
>>>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>>>> reverse transition, not if the disable count was already 0. This makes
>>>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>>>> be lock-free.
>>>>>>
>>>>>> v2:
>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>      mod_delayed_work.
>>>>>> v3:
>>>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>>>> v4:
>>>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>>>      adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>>>      checking for it to be 0 (Evan Quan)
>>>>>>
>>>>>> Cc: stable at vger.kernel.org
>>>>>> Reviewed-by: Lijo Lazar <lijo.lazar at amd.com> # v3
>>>>>> Acked-by: Christian König <christian.koenig at amd.com> # v3
>>>>>> Signed-off-by: Michel Dänzer <mdaenzer at redhat.com>
>>>>>> ---
>>>>>>
>>>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>>>     2 files changed, 30 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>         struct amdgpu_device *adev =
>>>>>>             container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>     -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>> -            adev->gfx.gfx_off_state = true;
>>>>>> -    }
>>>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>>>> +
>>>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>> +        adev->gfx.gfx_off_state = true;
>>>>>>     }
>>>>>>       /**
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> index a0be0772c8b3..b4ced45301be 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>>>           mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>     -    if (!enable)
>>>>>> -        adev->gfx.gfx_off_req_count++;
>>>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>>>> +    if (enable) {
>>>>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>>>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>>>>> +         * WARN_ON_ONCE.
>>>>>> +         */
>>>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>>> +            goto unlock;
>>>>>> +
>>>>>>             adev->gfx.gfx_off_req_count--;
>>>>>>     -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>>> -            adev->gfx.gfx_off_state = false;
>>>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>> +    } else {
>>>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>>> +
>>>>>> +            if (adev->gfx.gfx_off_state &&
>>>>>
>>>>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
>>>>
>>>> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>>>
>>>
>>> To clarify - when nothing is scheduled. If enable() is called when the count is 0, it goes to unlock. Now the expectation is someone to call Disable first.
>>
>> Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which
>>
>>          if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>
>> will catch.
>>
>>
>>> Let's say  Disable() is called first, then the variable will be false, right?
>>
>> Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW.
> 
> Exactly.

Turns out that's not the end of that rabbit (side-)hole yet. :)

amdgpu_device_init initializes adev->gfx.gfx_off_req_count = 1. amdgpu_gfx_off_ctrl is then called with enable=true from amdgpu_device_init → amdgpu_device_ip_late_init → amdgpu_device_set_pg_state. This schedules amdgpu_device_delay_enable_gfx_off, which runs ~100ms later, enables GFXOFF in the HW and sets adev->gfx.gfx_off_state = true.

So it looks fine as is actually, if a bit convoluted. (I wonder if GFXOFF shouldn't rather be enabled synchronously during initialization though)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer


More information about the amd-gfx mailing list