[PATCH v2 0/3] Fix DCN 3.1.4 hangs on s2idle entry
Limonciello, Mario
mario.limonciello at amd.com
Wed May 17 06:23:56 UTC 2023
>> I think we replaced this with golden timestamp value which doesn't
>> require GFX register access.
>
> Ah yes; through
>
> 5591a051b86b ("drm/amdgpu: refine get gpu clock counter method")
>
> This wasn't part of the kernel this was originally reported on.
>
> I suspect this would significantly decrease the likelihood of it
> occurring. I'll confirm it.
> I do think that patches 1/2 still make sense though because gfxoff can
> be triggered other ways too.
>
Confirmed that by adding:
5591a051b86b ("drm/amdgpu: refine get gpu clock counter method")
and
ea27ee2bea6b ("drm/amdgpu/gfx11: update gpu_clock_counter logic")
the original issue goes away.
I will still refine my patches and send a v3 up though as GFXOFF can be
triggered other ways by userspace and we should avoid this bug.
@Alex:
Can you please queue up ea27ee2bea6b for this week's fixes and include
the tags:
Cc: stable at vger.kernel.org # 6.1.y: 5591a051b86b: drm/amdgpu: refine get
gpu clock counter method
Cc: stable at vger.kernel.org # 6.2.y: 5591a051b86b: drm/amdgpu: refine get
gpu clock counter method
Cc: stable at vger.kernel.org # 6.3.y: 5591a051b86b: drm/amdgpu: refine get
gpu clock counter method
>> Here is the function calls with the patched kernel:
>>>
>>> [ 32.720456] amdgpu 0000:c2:00.0: amdgpu: set GFX off state to
>>> enabled, count:1
>>> [ 32.720457] amdgpu 0000:c2:00.0: amdgpu: broke gfx_off_mutex for
>>> gfx_v11_0_get_gpu_clock_counter+0xa8/0xf0 [amdgpu],
>>> adev->gfx.gfx_off_state is 0
>>> [ 32.760475] PM: suspend entry (s2idle)
>>> [ 32.768996] Filesystems sync: 0.008 seconds
>>> [ 32.769310] Freezing user space processes
>>> [ 32.776527] Freezing user space processes completed (elapsed
>>> 0.007 seconds)
>>> [ 32.776530] OOM killer disabled.
>>> [ 32.776531] Freezing remaining freezable tasks
>>> [ 32.777528] Freezing remaining freezable tasks completed (elapsed
>>> 0.000 seconds)
>>> [ 32.777531] printk: Suspending console(s) (use no_console_suspend
>>> to debug)
>>> [ 32.817853] amdgpu 0000:c2:00.0: amdgpu: Delayed work to enable
>>> gfxoff
>>> [ 32.817857] amdgpu 0000:c2:00.0: amdgpu:
>>> amdgpu_dpm_set_powergating_by_smu by
>>> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
>>> [ 32.818142] amdgpu 0000:c2:00.0: amdgpu: broke pm.mutex for
>>> amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
>>> [ 32.852099] amdgpu 0000:c2:00.0: amdgpu: smu_suspend: suspend called
>>> [ 32.852101] amdgpu 0000:c2:00.0: amdgpu: smu_disable_dpms: called
>>>
>>> Without patch 1 the delayed work doesn't get called on entry ever.
>>>
>>>> Can we remove this code also as there is a flush anyway with patch 1?
>>>
>>> Sure. Do you think it should go into patch 1 or on it's own?
>>>
>>
>> Preferably in patch 1 itself as it explains why it was removed.
> OK.
>>
>>>> Also, is there a need to call GFXOFF forcefully on S0ix suspend
>>>> (any chance that gfxoff is not scheduled)?
>>>>
>>> If using "echo mem | sudo tee /sys/power/state" I've confirmed that
>>> it's already in GFXOFF. I don't think this case should happen.
>>>>>> 2) RLC is never stopped on GFX 10 or greater.
>>>>>>
>>>>> System was hanging before this series.
>>>>>
>>>>> Patch 3 "alone" matches this behavior as described above to skip
>>>>> RLC suspend but two problems happen:
>>>>>
>>>>> 1) GFXOFF workqueue doesn't get flushed and so driver's request
>>>>> for GFXOFF can happen at wrong time.
>>>>>
>>>>> 2) If suspend entry happens before GFXOFF is really asserted lots
>>>>> of errors on resume. IE:
>>>>>
>>>>
>>>> Is patch 3 really required? Does it make any difference?
>>>>
>>> No; patch 3 isn't really required with patches 1 and 2.
>>>
>>
>> My preference is to drop patch 3 and not to have an additional place
>> of in_s0ix check.
> OK.
>>
>> Thanks,
>> Lijo
>>
More information about the amd-gfx
mailing list