[PATCH] drm/radeon: add a force flush to delay work when radeon
李真能
lizhenneng at kylinos.cn
Fri Aug 19 09:34:21 UTC 2022
在 2022/8/17 19:40, Christian König 写道:
> Am 17.08.22 um 09:31 schrieb 李真能:
>>
>> 在 2022/8/15 21:12, Christian König 写道:
>>> Am 15.08.22 um 09:34 schrieb 李真能:
>>>>
>>>> 在 2022/8/12 18:55, Christian König 写道:
>>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>>>> Although radeon card fence and wait for gpu to finish processing
>>>>>> current batch rings,
>>>>>> there is still a corner case that radeon lockup work queue may
>>>>>> not be fully flushed,
>>>>>> and meanwhile the radeon_suspend_kms() function has called
>>>>>> pci_set_power_state() to
>>>>>> put device in D3hot state.
>>>>>
>>>>> If I'm not completely mistaken the reset worker uses the
>>>>> suspend/resume functionality as well to get the hardware into a
>>>>> working state again.
>>>>>
>>>>> So if I'm not completely mistaken this here would lead to a
>>>>> deadlock, please double check that.
>>>>
>>>> We have tested many times, there are no deadlock.
>>>
>>> Testing doesn't tells you anything, you need to audit the call paths.
>>>
>>>> In which situation, there would lead to a deadlock?
>>>
>>> GPU resets.
>>
>> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will
>> wait for a lockup_work to finish executing the last queueing, but
>> this kernel func haven't get any lock, and lockup_work will run in
>> another kernel thread, so I think flush_delayed_work could not lead
>> to a deadlock.
>>
>> Therefor if radeon_gpu_reset is called in another thread when
>> radeon_suspend_kms is blocked on flush_delayed_work, there could not
>> lead to a deadlock.
>
> Ok sounds like you didn't go what I wanted to say.
>
> The key problem is that radeon_gpu_reset() calls radeon_suspend()
> which in turn calls rdev->asic->suspend().
>
> And this function in turn could end up in radeon_suspend_kms() again,
> but I'm not 100% sure about that.
>
> Just double check the order of function called here (e.g. if
> radeon_suspend_kms() call radeon_suspend() or the other way around).
> Apart from that your patch looks correct to me as well.
>
radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not
call radeon_suspend_kms, the more detail of call flow, we can see the
attachement file: radeon-suspend-reset.png.
Sorry I may have mistaken your meaning.
> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>>>> Configuration and Message requests are the only TLPs accepted by
>>>>>>> a Function in
>>>>>>> the D3hot state. All other received Requests must be handled as
>>>>>>> Unsupported Requests,
>>>>>>> and all received Completions may optionally be handled as
>>>>>>> Unexpected Completions.
>>>>>> This issue will happen in following logs:
>>>>>> Unable to handle kernel paging request at virtual address
>>>>>> 00008800e0008010
>>>>>> CPU 0 kworker/0:3(131): Oops 0
>>>>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000
>>>>>> Tainted: G W
>>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000
>>>>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258
>>>>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000
>>>>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018
>>>>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000
>>>>>> s6 = fff00007ef07bd98
>>>>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008
>>>>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338
>>>>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800
>>>>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000
>>>>>> gp = ffffffff81d89690 sp = 00000000aa814126
>>>>>> Disabling lock debugging due to kernel taint
>>>>>> Trace:
>>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101
>>>>>> <88210000> 4821ed21
>>>>>> So force lockup work queue flush to fix this problem.
>>>>>>
>>>>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>>>>>> ---
>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>>> 1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> index 15692cb241fc..e608ca26780a 100644
>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device
>>>>>> *dev, bool suspend,
>>>>>> if (r) {
>>>>>> /* delay GPU reset to resume */
>>>>>> radeon_fence_driver_force_completion(rdev, i);
>>>>>> + } else {
>>>>>> + /* finish executing delayed work */
>>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>>> }
>>>>>> }
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220819/fa7df537/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: radeon-suspend-reset.png
Type: image/png
Size: 111773 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220819/fa7df537/attachment-0001.png>
More information about the amd-gfx
mailing list