<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">在 2022/8/17 19:40, Christian König 写道:<br>
</div>
<blockquote type="cite"
cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Am
17.08.22 um 09:31 schrieb 李真能:
<br>
<blockquote type="cite">
<br>
在 2022/8/15 21:12, Christian König 写道:
<br>
<blockquote type="cite">Am 15.08.22 um 09:34 schrieb 李真能:
<br>
<blockquote type="cite">
<br>
在 2022/8/12 18:55, Christian König 写道:
<br>
<blockquote type="cite">Am 11.08.22 um 09:25 schrieb
Zhenneng Li:
<br>
<blockquote type="cite">Although radeon card fence and
wait for gpu to finish processing current batch rings,
<br>
there is still a corner case that radeon lockup work
queue may not be fully flushed,
<br>
and meanwhile the radeon_suspend_kms() function has
called pci_set_power_state() to
<br>
put device in D3hot state.
<br>
</blockquote>
<br>
If I'm not completely mistaken the reset worker uses the
suspend/resume functionality as well to get the hardware
into a working state again.
<br>
<br>
So if I'm not completely mistaken this here would lead to
a deadlock, please double check that.
<br>
</blockquote>
<br>
We have tested many times, there are no deadlock.
<br>
</blockquote>
<br>
Testing doesn't tells you anything, you need to audit the call
paths.
<br>
<br>
<blockquote type="cite">In which situation, there would lead
to a deadlock?
<br>
</blockquote>
<br>
GPU resets.
<br>
</blockquote>
<br>
Although
flush_delayed_work(&rdev->fence_drv[i].lockup_work) will
wait for a lockup_work to finish executing the last queueing,
but this kernel func haven't get any lock, and lockup_work will
run in another kernel thread, so I think flush_delayed_work
could not lead to a deadlock.
<br>
<br>
Therefor if radeon_gpu_reset is called in another thread when
radeon_suspend_kms is blocked on flush_delayed_work, there could
not lead to a deadlock.
<br>
</blockquote>
<br>
Ok sounds like you didn't go what I wanted to say.
<br>
<br>
The key problem is that radeon_gpu_reset() calls radeon_suspend()
which in turn calls rdev->asic->suspend().
<br>
<br>
And this function in turn could end up in radeon_suspend_kms()
again, but I'm not 100% sure about that.
<br>
<br>
Just double check the order of function called here (e.g. if
radeon_suspend_kms() call radeon_suspend() or the other way
around). Apart from that your patch looks correct to me as well.
<br>
<br>
</blockquote>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;">radeon_gpu_reset
will call radeon_suspend, but radeon_suspend will not call
radeon_suspend_kms, the more detail of call flow, we can see the
attachement file: radeon-suspend-reset.png.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;">Sorry I
may have mistaken your meaning.<br>
</p>
<p>
<style type="text/css">p, li { white-space: pre-wra</style><br>
</p>
<blockquote type="cite"
cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Regards,
<br>
Christian.
<br>
<br>
<blockquote type="cite">
<br>
<blockquote type="cite">
<br>
Regards,
<br>
Christian.
<br>
<br>
<blockquote type="cite">
<br>
<blockquote type="cite">
<br>
Regards,
<br>
Christian.
<br>
<br>
<blockquote type="cite">Per PCI spec rev 4.0 on 5.3.1.4.1
D3hot State.
<br>
<blockquote type="cite">Configuration and Message
requests are the only TLPs accepted by a Function in
<br>
the D3hot state. All other received Requests must be
handled as Unsupported Requests,
<br>
and all received Completions may optionally be handled
as Unexpected Completions.
<br>
</blockquote>
This issue will happen in following logs:
<br>
Unable to handle kernel paging request at virtual
address 00008800e0008010
<br>
CPU 0 kworker/0:3(131): Oops 0
<br>
pc = [<ffffffff811bea5c>] ra =
[<ffffffff81240844>] ps = 0000 Tainted: G
W
<br>
pc is at si_gpu_check_soft_reset+0x3c/0x240
<br>
ra is at si_dma_is_lockup+0x34/0xd0
<br>
v0 = 0000000000000000 t0 = fff08800e0008010 t1 =
0000000000010000
<br>
t2 = 0000000000008010 t3 = fff00007e3c00000 t4 =
fff00007e3c00258
<br>
t5 = 000000000000ffff t6 = 0000000000000001 t7 =
fff00007ef078000
<br>
s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 =
fff00007e3c00018
<br>
s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 =
0000000000000000
<br>
s6 = fff00007ef07bd98
<br>
a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 =
0000000000000008
<br>
a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 =
ffffffff810f4338
<br>
t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 =
ff6769c5d964b800
<br>
t11= 000000000000b886 pv = ffffffff811bea20 at =
0000000000000000
<br>
gp = ffffffff81d89690 sp = 00000000aa814126
<br>
Disabling lock debugging due to kernel taint
<br>
Trace:
<br>
[<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
<br>
[<ffffffff81119610>]
radeon_fence_check_lockup+0xd0/0x290
<br>
[<ffffffff80977010>] process_one_work+0x280/0x550
<br>
[<ffffffff80977350>] worker_thread+0x70/0x7c0
<br>
[<ffffffff80977410>] worker_thread+0x130/0x7c0
<br>
[<ffffffff80982040>] kthread+0x200/0x210
<br>
[<ffffffff809772e0>] worker_thread+0x0/0x7c0
<br>
[<ffffffff80981f8c>] kthread+0x14c/0x210
<br>
[<ffffffff80911658>]
ret_from_kernel_thread+0x18/0x20
<br>
[<ffffffff80981e40>] kthread+0x0/0x210
<br>
Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8
40230101
<br>
<88210000> 4821ed21
<br>
So force lockup work queue flush to fix this problem.
<br>
<br>
Signed-off-by: Zhenneng Li <a class="moz-txt-link-rfc2396E" href="mailto:lizhenneng@kylinos.cn"><lizhenneng@kylinos.cn></a>
<br>
---
<br>
drivers/gpu/drm/radeon/radeon_device.c | 3 +++
<br>
1 file changed, 3 insertions(+)
<br>
<br>
diff --git a/drivers/gpu/drm/radeon/radeon_device.c
b/drivers/gpu/drm/radeon/radeon_device.c
<br>
index 15692cb241fc..e608ca26780a 100644
<br>
--- a/drivers/gpu/drm/radeon/radeon_device.c
<br>
+++ b/drivers/gpu/drm/radeon/radeon_device.c
<br>
@@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct
drm_device *dev, bool suspend,
<br>
if (r) {
<br>
/* delay GPU reset to resume */
<br>
radeon_fence_driver_force_completion(rdev,
i);
<br>
+ } else {
<br>
+ /* finish executing delayed work */
<br>
+
flush_delayed_work(&rdev->fence_drv[i].lockup_work);
<br>
}
<br>
}
<br>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</blockquote>
</body>
</html>