[PATCH] drm/radeon: add a force flush to delay work when radeon

Fri Aug 19 10:07:18 UTC 2022

Am 19.08.22 um 11:34 schrieb 李真能:
>
> 在 2022/8/17 19:40, Christian König 写道:
>
>> Am 17.08.22 um 09:31 schrieb 李真能:
>>>
>>> 在 2022/8/15 21:12, Christian König 写道:
>>>> Am 15.08.22 um 09:34 schrieb 李真能:
>>>>>
>>>>> 在 2022/8/12 18:55, Christian König 写道:
>>>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>>>>> Although radeon card fence and wait for gpu to finish processing 
>>>>>>> current batch rings,
>>>>>>> there is still a corner case that radeon lockup work queue may 
>>>>>>> not be fully flushed,
>>>>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>>>>> pci_set_power_state() to
>>>>>>> put device in D3hot state.
>>>>>>
>>>>>> If I'm not completely mistaken the reset worker uses the 
>>>>>> suspend/resume functionality as well to get the hardware into a 
>>>>>> working state again.
>>>>>>
>>>>>> So if I'm not completely mistaken this here would lead to a 
>>>>>> deadlock, please double check that.
>>>>>
>>>>> We have tested many times, there are no deadlock.
>>>>
>>>> Testing doesn't tells you anything, you need to audit the call paths.
>>>>
>>>>> In which situation, there would lead to a deadlock?
>>>>
>>>> GPU resets.
>>>
>>> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will 
>>> wait for a lockup_work to finish executing the last queueing,  but 
>>> this kernel func haven't get any lock, and lockup_work will run in 
>>> another kernel thread, so I think flush_delayed_work could not lead 
>>> to a deadlock.
>>>
>>> Therefor if radeon_gpu_reset is called in another thread when 
>>> radeon_suspend_kms is blocked on flush_delayed_work, there could not 
>>> lead to a deadlock.
>>
>> Ok sounds like you didn't go what I wanted to say.
>>
>> The key problem is that radeon_gpu_reset() calls radeon_suspend() 
>> which in turn calls rdev->asic->suspend().
>>
>> And this function in turn could end up in radeon_suspend_kms() again, 
>> but I'm not 100% sure about that.
>>
>> Just double check the order of function called here (e.g. if 
>> radeon_suspend_kms() call radeon_suspend() or the other way around).  
>> Apart from that your patch looks correct to me as well.
>>
> radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not 
> call radeon_suspend_kms, the more detail of call flow, we can see the 
> attachement file: radeon-suspend-reset.png.
>
> Sorry  I may have mistaken your meaning.
>

Ok in this case my memory of the call flow wasn't correct any more.

Feel free to add an Acked-by: Christian König <christian.koenig at amd.com> 
to the patch.

Alex should then pick it up for upstreaming.

Thanks,
Christian.

>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>>>>> Configuration and Message requests are the only TLPs accepted 
>>>>>>>> by a Function in
>>>>>>>> the D3hot state. All other received Requests must be handled as 
>>>>>>>> Unsupported Requests,
>>>>>>>> and all received Completions may optionally be handled as 
>>>>>>>> Unexpected Completions.
>>>>>>> This issue will happen in following logs:
>>>>>>> Unable to handle kernel paging request at virtual address 
>>>>>>> 00008800e0008010
>>>>>>> CPU 0 kworker/0:3(131): Oops 0
>>>>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>>>>> Tainted: G        W
>>>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>>>>> s6 = fff00007ef07bd98
>>>>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = 
>>>>>>> ff6769c5d964b800
>>>>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>>>>> Disabling lock debugging due to kernel taint
>>>>>>> Trace:
>>>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020 8c3001e8 40230101
>>>>>>>   <88210000> 4821ed21
>>>>>>> So force lockup work queue flush to fix this problem.
>>>>>>>
>>>>>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>>>>   1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> index 15692cb241fc..e608ca26780a 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>>>>> *dev, bool suspend,
>>>>>>>           if (r) {
>>>>>>>               /* delay GPU reset to resume */
>>>>>>> radeon_fence_driver_force_completion(rdev, i);
>>>>>>> +        } else {
>>>>>>> +            /* finish executing delayed work */
>>>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>>>>           }
>>>>>>>       }
>>>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20220819/57550ba0/attachment-0001.htm>