[PATCH] drm/radeon: add a force flush to delay work when radeon
Christian König
christian.koenig at amd.com
Mon Aug 15 13:12:18 UTC 2022
Am 15.08.22 um 09:34 schrieb 李真能:
>
> 在 2022/8/12 18:55, Christian König 写道:
>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>> Although radeon card fence and wait for gpu to finish processing
>>> current batch rings,
>>> there is still a corner case that radeon lockup work queue may not
>>> be fully flushed,
>>> and meanwhile the radeon_suspend_kms() function has called
>>> pci_set_power_state() to
>>> put device in D3hot state.
>>
>> If I'm not completely mistaken the reset worker uses the
>> suspend/resume functionality as well to get the hardware into a
>> working state again.
>>
>> So if I'm not completely mistaken this here would lead to a deadlock,
>> please double check that.
>
> We have tested many times, there are no deadlock.
Testing doesn't tells you anything, you need to audit the call paths.
> In which situation, there would lead to a deadlock?
GPU resets.
Regards,
Christian.
>
>>
>> Regards,
>> Christian.
>>
>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>> Configuration and Message requests are the only TLPs accepted by a
>>>> Function in
>>>> the D3hot state. All other received Requests must be handled as
>>>> Unsupported Requests,
>>>> and all received Completions may optionally be handled as
>>>> Unexpected Completions.
>>> This issue will happen in following logs:
>>> Unable to handle kernel paging request at virtual address
>>> 00008800e0008010
>>> CPU 0 kworker/0:3(131): Oops 0
>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000
>>> Tainted: G W
>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>> ra is at si_dma_is_lockup+0x34/0xd0
>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000
>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258
>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000
>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018
>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000
>>> s6 = fff00007ef07bd98
>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008
>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338
>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800
>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000
>>> gp = ffffffff81d89690 sp = 00000000aa814126
>>> Disabling lock debugging due to kernel taint
>>> Trace:
>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>> [<ffffffff80982040>] kthread+0x200/0x210
>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101
>>> <88210000> 4821ed21
>>> So force lockup work queue flush to fix this problem.
>>>
>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>>> ---
>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>> index 15692cb241fc..e608ca26780a 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev,
>>> bool suspend,
>>> if (r) {
>>> /* delay GPU reset to resume */
>>> radeon_fence_driver_force_completion(rdev, i);
>>> + } else {
>>> + /* finish executing delayed work */
>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>> }
>>> }
>>
More information about the amd-gfx
mailing list