[PATCH 1/3] drm/amdkfd: Disallow debugfs to hang hws when GPU is resetting

Oak Zeng Oak.Zeng at amd.com
Wed Jul 14 15:25:41 UTC 2021


If GPU is during a resetting cycle, writing to GPU can cause
unpredictable protection fault, see below call trace. Disallow using kfd debugfs
hang_hws to hang hws if GPU is resetting.

[12808.234114] general protection fault: 0000 [#1] SMP NOPTI
[12808.234119] CPU: 13 PID: 6334 Comm: tee Tainted: G           OE
5.4.0-77-generic #86-Ubuntu
[12808.234121] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE
WIFI, BIOS 0211 11/27/2020
[12808.234220] RIP: 0010:kq_submit_packet+0xd/0x50 [amdgpu]
[12808.234222] Code: 8b 45 d0 48 c7 00 00 00 00 00 b8 f4 ff ff ff eb df 66 66
2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 8b 17 48 8b 47 48 <48> 8b 52
08 48 89 e5 83 7a 20 08 74 14 8b 77 20 89 30 48 8b 47 10
[12808.234224] RSP: 0018:ffffb0bf4954bdc0 EFLAGS: 00010216
[12808.234226] RAX: ffffb0bf4a1a5a00 RBX: ffff99302895c0c8 RCX:
0000000000000000
[12808.234227] RDX: c3156d43d3a04949 RSI: 0000000000000055 RDI:
ffff99302584c300
[12808.234228] RBP: ffffb0bf4954bdf8 R08: 0000000000000543 R09:
ffffb0bf4a1a4230
[12808.234229] R10: 000000000000000a R11: f000000000000000 R12:
0000000000000000
[12808.234230] R13: ffff99302895c0d8 R14: 00007ffebb3d18f0 R15:
0000000000000005
[12808.234232] FS:  00007f0d822ef580(0000) GS:ffff99307d340000(0000)
knlGS:0000000000000000
[12808.234233] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12808.234234] CR2: 00007ffebb3d1908 CR3: 0000001efe1ec000 CR4:
0000000000340ee0
[12808.234235] Call Trace:
[12808.234324]  ? pm_debugfs_hang_hws+0x71/0xd0 [amdgpu]
[12808.234408]  kfd_debugfs_hang_hws+0x2e/0x50 [amdgpu]
[12808.234494]  kfd_debugfs_hang_hws_write+0xb6/0xc0 [amdgpu]
[12808.234499]  full_proxy_write+0x5c/0x90
[12808.234502]  __vfs_write+0x1b/0x40
[12808.234504]  vfs_write+0xb9/0x1a0
[12808.234506]  ksys_write+0x67/0xe0
[12808.234508]  __x64_sys_write+0x1a/0x20
[12808.234511]  do_syscall_64+0x57/0x190
[12808.234514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Oak Zeng <Oak.Zeng at amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9e4a05e..fc77d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1390,6 +1390,11 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
 		return -EINVAL;
 	}
 
+	if (dev->dqm->is_resetting) {
+		pr_err("HWS is already resetting, please wait for the current reset to finish\n");
+		return -EBUSY;
+	}
+
 	r = pm_debugfs_hang_hws(&dev->dqm->packets);
 	if (!r)
 		r = dqm_debugfs_execute_queues(dev->dqm);
-- 
2.7.4



More information about the amd-gfx mailing list