[PATCH] drm/radeon: add a force flush to delay work when radeon

Mon Aug 15 07:34:18 UTC 2022

在 2022/8/12 18:55, Christian König 写道:
> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>> Although radeon card fence and wait for gpu to finish processing 
>> current batch rings,
>> there is still a corner case that radeon lockup work queue may not be 
>> fully flushed,
>> and meanwhile the radeon_suspend_kms() function has called 
>> pci_set_power_state() to
>> put device in D3hot state.
>
> If I'm not completely mistaken the reset worker uses the 
> suspend/resume functionality as well to get the hardware into a 
> working state again.
>
> So if I'm not completely mistaken this here would lead to a deadlock, 
> please double check that.

We have tested many times, there are no deadlock.

In which situation, there would lead to a deadlock?

>
> Regards,
> Christian.
>
>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>> Configuration and Message requests are the only TLPs accepted by a 
>>> Function in
>>> the D3hot state. All other received Requests must be handled as 
>>> Unsupported Requests,
>>> and all received Completions may optionally be handled as Unexpected 
>>> Completions.
>> This issue will happen in following logs:
>> Unable to handle kernel paging request at virtual address 
>> 00008800e0008010
>> CPU 0 kworker/0:3(131): Oops 0
>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>> Tainted: G        W
>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>> ra is at si_dma_is_lockup+0x34/0xd0
>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>> s6 = fff00007ef07bd98
>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>> gp = ffffffff81d89690  sp = 00000000aa814126
>> Disabling lock debugging due to kernel taint
>> Trace:
>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>> [<ffffffff80977010>] process_one_work+0x280/0x550
>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>> [<ffffffff80982040>] kthread+0x200/0x210
>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>> [<ffffffff80981e40>] kthread+0x0/0x210
>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>   <88210000> 4821ed21
>> So force lockup work queue flush to fix this problem.
>>
>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>> ---
>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>> b/drivers/gpu/drm/radeon/radeon_device.c
>> index 15692cb241fc..e608ca26780a 100644
>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, 
>> bool suspend,
>>           if (r) {
>>               /* delay GPU reset to resume */
>>               radeon_fence_driver_force_completion(rdev, i);
>> +        } else {
>> +            /* finish executing delayed work */
>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>           }
>>       }
>