[PATCH 1/3] drm/amdkfd: Disallow debugfs to hang hws when GPU is resetting
Oak Zeng
Oak.Zeng at amd.com
Wed Jul 14 15:25:41 UTC 2021
If GPU is during a resetting cycle, writing to GPU can cause
unpredictable protection fault, see below call trace. Disallow using kfd debugfs
hang_hws to hang hws if GPU is resetting.
[12808.234114] general protection fault: 0000 [#1] SMP NOPTI
[12808.234119] CPU: 13 PID: 6334 Comm: tee Tainted: G OE
5.4.0-77-generic #86-Ubuntu
[12808.234121] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE
WIFI, BIOS 0211 11/27/2020
[12808.234220] RIP: 0010:kq_submit_packet+0xd/0x50 [amdgpu]
[12808.234222] Code: 8b 45 d0 48 c7 00 00 00 00 00 b8 f4 ff ff ff eb df 66 66
2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 8b 17 48 8b 47 48 <48> 8b 52
08 48 89 e5 83 7a 20 08 74 14 8b 77 20 89 30 48 8b 47 10
[12808.234224] RSP: 0018:ffffb0bf4954bdc0 EFLAGS: 00010216
[12808.234226] RAX: ffffb0bf4a1a5a00 RBX: ffff99302895c0c8 RCX:
0000000000000000
[12808.234227] RDX: c3156d43d3a04949 RSI: 0000000000000055 RDI:
ffff99302584c300
[12808.234228] RBP: ffffb0bf4954bdf8 R08: 0000000000000543 R09:
ffffb0bf4a1a4230
[12808.234229] R10: 000000000000000a R11: f000000000000000 R12:
0000000000000000
[12808.234230] R13: ffff99302895c0d8 R14: 00007ffebb3d18f0 R15:
0000000000000005
[12808.234232] FS: 00007f0d822ef580(0000) GS:ffff99307d340000(0000)
knlGS:0000000000000000
[12808.234233] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12808.234234] CR2: 00007ffebb3d1908 CR3: 0000001efe1ec000 CR4:
0000000000340ee0
[12808.234235] Call Trace:
[12808.234324] ? pm_debugfs_hang_hws+0x71/0xd0 [amdgpu]
[12808.234408] kfd_debugfs_hang_hws+0x2e/0x50 [amdgpu]
[12808.234494] kfd_debugfs_hang_hws_write+0xb6/0xc0 [amdgpu]
[12808.234499] full_proxy_write+0x5c/0x90
[12808.234502] __vfs_write+0x1b/0x40
[12808.234504] vfs_write+0xb9/0x1a0
[12808.234506] ksys_write+0x67/0xe0
[12808.234508] __x64_sys_write+0x1a/0x20
[12808.234511] do_syscall_64+0x57/0x190
[12808.234514] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Oak Zeng <Oak.Zeng at amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9e4a05e..fc77d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1390,6 +1390,11 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
return -EINVAL;
}
+ if (dev->dqm->is_resetting) {
+ pr_err("HWS is already resetting, please wait for the current reset to finish\n");
+ return -EBUSY;
+ }
+
r = pm_debugfs_hang_hws(&dev->dqm->packets);
if (!r)
r = dqm_debugfs_execute_queues(dev->dqm);
--
2.7.4
More information about the amd-gfx
mailing list