[PATCH] drm/radeon: avoid page fault during gpu reset
Andreas Messer
andi at bastelmap.de
Sat Jan 25 18:47:47 UTC 2020
When backing up a ring, validate pointer to avoid page fault.
When the drivers attempts to handle a gpu lockup, a page fault might occur
during call of radeon_ring_backup() since (*ring->next_rptr_cpu_addr) could
have invalid content:
[ 3790.348267] radeon 0000:01:00.0: ring 0 stalled for more than 10150msec
[ 3790.348276] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000699e4 last fence id 0x00000000000699f9 on ring 0)
[ 3791.504484] BUG: unable to handle page fault for address: ffffba5602800ffc
[ 3791.504485] #PF: supervisor read access in kernel mode
[ 3791.504486] #PF: error_code(0x0000) - not-present page
[ 3791.504487] PGD 851d3b067 P4D 851d3b067 PUD 0
[ 3791.504488] Oops: 0000 [#1] SMP PTI
[ 3791.504490] CPU: 5 PID: 268 Comm: kworker/5:1H Tainted: G E 5.4.8-amesser #3
[ 3791.504491] Hardware name: Gigabyte Technology Co., Ltd. X170-WS ECC/X170-WS ECC-CF, BIOS F2 06/20/2016
[ 3791.504507] Workqueue: radeon-crtc radeon_flip_work_func [radeon]
[ 3791.504520] RIP: 0010:radeon_ring_backup+0xb9/0x130 [radeon]
It seems that my HD7750 enters such a state during thermal shutdown. Here
the kernel message with added debug print and fix:
[ 2930.783094] radeon 0000:01:00.0: ring 3 stalled for more than 10280msec
[ 2930.783104] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000011194b last fence id 0x000000000011196a on ring 3)
[ 2931.936653] radeon 0000:01:00.0: Bad ptr 0xffffffff [ -1] for backup
[ 2931.937704] radeon 0000:01:00.0: GPU softreset: 0x00000BFD
[ 2931.937705] radeon 0000:01:00.0: GRBM_STATUS = 0xFFFFFFFF
[ 2931.937707] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0xFFFFFFFF
Signed-off-by: Andreas Messer <andi at bastelmap.de>
---
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 37093cea24c5..bf55a682442a 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -309,6 +309,12 @@ unsigned radeon_ring_backup(struct radeon_device *rdev, struct radeon_ring *ring
return 0;
}
+ /* ptr could be invalid after thermal shutdown */
+ if (ptr >= (ring->ring_size / 4)) {
+ mutex_unlock(&rdev->ring_lock);
+ return 0;
+ }
+
size = ring->wptr + (ring->ring_size / 4);
size -= ptr;
size &= ring->ptr_mask;
More information about the amd-gfx
mailing list