[PATCH v2 0/3] Fix DCN 3.1.4 hangs on s2idle entry

Wed May 17 05:26:09 UTC 2023

On 5/17/2023 10:46 AM, Limonciello, Mario wrote:
> 
> On 5/17/2023 12:07 AM, Lazar, Lijo wrote:
>>
>>
>> On 5/17/2023 10:25 AM, Limonciello, Mario wrote:
>>>
>>> On 5/16/2023 11:43 PM, Lazar, Lijo wrote:
>>>>
>>>>
>>>> On 5/17/2023 5:04 AM, Mario Limonciello wrote:
>>>>> DCN 3.1.4 s2idle entry will hang
>>>>> occasionally on s2idle entry, but only if running Wayland and only
>>>>> when using `systemctl suspend`, not `echo mem | tee /sys/power/state`.
>>>>>
>>>>> This happens because using `systemctl suspend` will cause the screen
>>>>> to lock right before writing mem into /sys/power/state.
>>>>>
>>>>
>>>> A couple of things on this since this mentions systemctl suspend -
>>>>
>>>> 1) If in s2idle, it's supposed to immediately signal and not 
>>>> schedule delayed work
>>>>
>>>> 3964b0c2e843334858da99db881859faa4df241d drm/amdgpu: complete gfxoff 
>>>> allow signal during suspend without delay
>>>
>>> It looks like dead code to me now actually.
>>>
>>> amdgpu_device_set_pg_state() skips GFX, so gfxoff control never gets 
>>> called as part of suspend path.
>>>
>>
>> Ok, that means schedule happened sometime before. 
> To come up with these patches I had a test kernel with extra prints that 
> showed the function call orders.
> 
> With systemctl suspend there is a call to 
> gfx_v11_0_get_gpu_clock_counter() from userspace IOCTL that triggers all 
> this behavior. 

I think we replaced this with golden timestamp value which doesn't 
require GFX register access.

  Here is the function calls with the patched kernel:
> 
> [   32.720456] amdgpu 0000:c2:00.0: amdgpu: set GFX off state to 
> enabled, count:1
> [   32.720457] amdgpu 0000:c2:00.0: amdgpu: broke gfx_off_mutex for 
> gfx_v11_0_get_gpu_clock_counter+0xa8/0xf0 [amdgpu], 
> adev->gfx.gfx_off_state is 0
> [   32.760475] PM: suspend entry (s2idle)
> [   32.768996] Filesystems sync: 0.008 seconds
> [   32.769310] Freezing user space processes
> [   32.776527] Freezing user space processes completed (elapsed 0.007 
> seconds)
> [   32.776530] OOM killer disabled.
> [   32.776531] Freezing remaining freezable tasks
> [   32.777528] Freezing remaining freezable tasks completed (elapsed 
> 0.000 seconds)
> [   32.777531] printk: Suspending console(s) (use no_console_suspend to 
> debug)
> [   32.817853] amdgpu 0000:c2:00.0: amdgpu: Delayed work to enable gfxoff
> [   32.817857] amdgpu 0000:c2:00.0: amdgpu: 
> amdgpu_dpm_set_powergating_by_smu by 
> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
> [   32.818142] amdgpu 0000:c2:00.0: amdgpu: broke pm.mutex for 
> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
> [   32.852099] amdgpu 0000:c2:00.0: amdgpu: smu_suspend: suspend called
> [   32.852101] amdgpu 0000:c2:00.0: amdgpu: smu_disable_dpms: called
> 
> Without patch 1 the delayed work doesn't get called on entry ever.
> 
>> Can we remove this code also as there is a flush anyway with patch 1?
> 
> Sure.  Do you think it should go into patch 1 or on it's own?
> 

Preferably in patch 1 itself as it explains why it was removed.

>> Also, is there a need to call GFXOFF forcefully on S0ix suspend (any 
>> chance that gfxoff is not scheduled)?
>>
> If using "echo mem | sudo tee /sys/power/state" I've confirmed that it's 
> already in GFXOFF.  I don't think this case should happen.
>>>> 2) RLC is never stopped on GFX 10 or greater.
>>>>
>>> System was hanging before this series.
>>>
>>> Patch 3 "alone" matches this behavior as described above to skip RLC 
>>> suspend but two problems happen:
>>>
>>> 1) GFXOFF workqueue doesn't get flushed and so driver's request for 
>>> GFXOFF can happen at wrong time.
>>>
>>> 2) If suspend entry happens before GFXOFF is really asserted lots of 
>>> errors on resume. IE:
>>>
>>
>> Is patch 3 really required?  Does it make any difference?
>>
> No; patch 3 isn't really required with patches 1 and 2.
> 

My preference is to drop patch 3 and not to have an additional place of 
in_s0ix check.

Thanks,
Lijo

>> Thanks,
>> Lijo
>>
>>> [   63.095227] [drm] Fence fallback timer expired on ring sdma0
>>> [   63.098360] [drm] ring gfx_32772.1.1 was added
>>> [   63.099439] [drm] ring compute_32772.2.2 was added
>>> [   63.100460] [drm] ring sdma_32772.3.3 was added
>>> [   63.100504] [drm] ring gfx_32772.1.1 test pass
>>> [   63.607166] [drm] Fence fallback timer expired on ring gfx_32772.1.1
>>> [   63.607234] [drm] ring gfx_32772.1.1 ib test pass
>>> [   63.608964] [drm] ring compute_32772.2.2 test pass
>>> [   64.119173] [drm] Fence fallback timer expired on ring 
>>> compute_32772.2.2
>>> [   64.119219] [drm] ring compute_32772.2.2 ib test pass
>>> [   64.121364] [drm] ring sdma_32772.3.3 test pass
>>> [   64.631422] [drm] Fence fallback timer expired on ring sdma_32772.3.3
>>> [   64.631465] [drm] ring sdma_32772.3.3 ib test pass
>>> [   65.143184] [drm] Fence fallback timer expired on ring sdma0
>>>
>>>> Wondering if the code hides something else because of the timing.
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> This causes a delayed GFXOFF entry to be scheduled right before s2idle
>>>>> entry.  If the workqueue doesn't get processed before the RLC is 
>>>>> turned
>>>>> off the system is hung. Even if the workqueue *does* get processed, 
>>>>> there
>>>>> is a race between the APU microcontrollers and driver for whether GFX
>>>>> is actually powered off when RLC is turned off.
>>>>>
>>>>> To avoid this issue, flush the workqueue on s2idle entry and ensure 
>>>>> that
>>>>> GFX is really in GFXOFF before any sensitive register accesses occur.
>>>>>
>>>>> Mario Limonciello (3):
>>>>>    drm/amd: Flush any delayed gfxoff on suspend entry
>>>>>    drm/amd: Poll for GFX core to be off
>>>>>    drm/amd: Skip RLC suspend for s0ix on PSP 13.0.4 and 13.0.11
>>>>>
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 
>>>>> ++++++++++++++++++++++
>>>>>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 18 ++++++++++++++++
>>>>>   drivers/gpu/drm/amd/include/amd_shared.h   |  1 +
>>>>>   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  4 ++--
>>>>>   4 files changed, 46 insertions(+), 2 deletions(-)
>>>>>