[PATCH v2 0/3] Fix DCN 3.1.4 hangs on s2idle entry

Limonciello, Mario mlimonci at amd.com
Wed May 17 05:35:51 UTC 2023


On 5/17/2023 12:26 AM, Lazar, Lijo wrote:
>
>
> On 5/17/2023 10:46 AM, Limonciello, Mario wrote:
>>
>> On 5/17/2023 12:07 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 5/17/2023 10:25 AM, Limonciello, Mario wrote:
>>>>
>>>> On 5/16/2023 11:43 PM, Lazar, Lijo wrote:
>>>>>
>>>>>
>>>>> On 5/17/2023 5:04 AM, Mario Limonciello wrote:
>>>>>> DCN 3.1.4 s2idle entry will hang
>>>>>> occasionally on s2idle entry, but only if running Wayland and only
>>>>>> when using `systemctl suspend`, not `echo mem | tee 
>>>>>> /sys/power/state`.
>>>>>>
>>>>>> This happens because using `systemctl suspend` will cause the screen
>>>>>> to lock right before writing mem into /sys/power/state.
>>>>>>
>>>>>
>>>>> A couple of things on this since this mentions systemctl suspend -
>>>>>
>>>>> 1) If in s2idle, it's supposed to immediately signal and not 
>>>>> schedule delayed work
>>>>>
>>>>> 3964b0c2e843334858da99db881859faa4df241d drm/amdgpu: complete 
>>>>> gfxoff allow signal during suspend without delay
>>>>
>>>> It looks like dead code to me now actually.
>>>>
>>>> amdgpu_device_set_pg_state() skips GFX, so gfxoff control never 
>>>> gets called as part of suspend path.
>>>>
>>>
>>> Ok, that means schedule happened sometime before. 
>> To come up with these patches I had a test kernel with extra prints 
>> that showed the function call orders.
>>
>> With systemctl suspend there is a call to 
>> gfx_v11_0_get_gpu_clock_counter() from userspace IOCTL that triggers 
>> all this behavior. 
>
> I think we replaced this with golden timestamp value which doesn't 
> require GFX register access.

Ah yes; through

5591a051b86b ("drm/amdgpu: refine get gpu clock counter method")

This wasn't part of the kernel this was originally reported on.

I suspect this would significantly decrease the likelihood of it 
occurring.  I'll confirm it.
I do think that patches 1/2 still make sense though because gfxoff can 
be triggered other ways too.

>
>  Here is the function calls with the patched kernel:
>>
>> [   32.720456] amdgpu 0000:c2:00.0: amdgpu: set GFX off state to 
>> enabled, count:1
>> [   32.720457] amdgpu 0000:c2:00.0: amdgpu: broke gfx_off_mutex for 
>> gfx_v11_0_get_gpu_clock_counter+0xa8/0xf0 [amdgpu], 
>> adev->gfx.gfx_off_state is 0
>> [   32.760475] PM: suspend entry (s2idle)
>> [   32.768996] Filesystems sync: 0.008 seconds
>> [   32.769310] Freezing user space processes
>> [   32.776527] Freezing user space processes completed (elapsed 0.007 
>> seconds)
>> [   32.776530] OOM killer disabled.
>> [   32.776531] Freezing remaining freezable tasks
>> [   32.777528] Freezing remaining freezable tasks completed (elapsed 
>> 0.000 seconds)
>> [   32.777531] printk: Suspending console(s) (use no_console_suspend 
>> to debug)
>> [   32.817853] amdgpu 0000:c2:00.0: amdgpu: Delayed work to enable 
>> gfxoff
>> [   32.817857] amdgpu 0000:c2:00.0: amdgpu: 
>> amdgpu_dpm_set_powergating_by_smu by 
>> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
>> [   32.818142] amdgpu 0000:c2:00.0: amdgpu: broke pm.mutex for 
>> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
>> [   32.852099] amdgpu 0000:c2:00.0: amdgpu: smu_suspend: suspend called
>> [   32.852101] amdgpu 0000:c2:00.0: amdgpu: smu_disable_dpms: called
>>
>> Without patch 1 the delayed work doesn't get called on entry ever.
>>
>>> Can we remove this code also as there is a flush anyway with patch 1?
>>
>> Sure.  Do you think it should go into patch 1 or on it's own?
>>
>
> Preferably in patch 1 itself as it explains why it was removed.
OK.
>
>>> Also, is there a need to call GFXOFF forcefully on S0ix suspend (any 
>>> chance that gfxoff is not scheduled)?
>>>
>> If using "echo mem | sudo tee /sys/power/state" I've confirmed that 
>> it's already in GFXOFF.  I don't think this case should happen.
>>>>> 2) RLC is never stopped on GFX 10 or greater.
>>>>>
>>>> System was hanging before this series.
>>>>
>>>> Patch 3 "alone" matches this behavior as described above to skip 
>>>> RLC suspend but two problems happen:
>>>>
>>>> 1) GFXOFF workqueue doesn't get flushed and so driver's request for 
>>>> GFXOFF can happen at wrong time.
>>>>
>>>> 2) If suspend entry happens before GFXOFF is really asserted lots 
>>>> of errors on resume. IE:
>>>>
>>>
>>> Is patch 3 really required?  Does it make any difference?
>>>
>> No; patch 3 isn't really required with patches 1 and 2.
>>
>
> My preference is to drop patch 3 and not to have an additional place 
> of in_s0ix check.
OK.
>
> Thanks,
> Lijo
>


More information about the amd-gfx mailing list