[PATCH] drm/radeon: avoid page fault during gpu reset

Sat Jan 25 19:01:36 UTC 2020

Am 25.01.2020 19:47 schrieb Andreas Messer <andi at bastelmap.de>:
When backing up a ring, validate pointer to avoid page fault.

When the drivers attempts to handle a gpu lockup, a page fault might occur
during call of radeon_ring_backup() since (*ring->next_rptr_cpu_addr) could
have invalid content:

  [ 3790.348267] radeon 0000:01:00.0: ring 0 stalled for more than 10150msec
  [ 3790.348276] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000699e4 last fence id 0x00000000000699f9 on ring 0)
  [ 3791.504484] BUG: unable to handle page fault for address: ffffba5602800ffc
  [ 3791.504485] #PF: supervisor read access in kernel mode
  [ 3791.504486] #PF: error_code(0x0000) - not-present page
  [ 3791.504487] PGD 851d3b067 P4D 851d3b067 PUD 0
  [ 3791.504488] Oops: 0000 [#1] SMP PTI
  [ 3791.504490] CPU: 5 PID: 268 Comm: kworker/5:1H Tainted: G            E     5.4.8-amesser #3
  [ 3791.504491] Hardware name: Gigabyte Technology Co., Ltd. X170-WS ECC/X170-WS ECC-CF, BIOS F2 06/20/2016
  [ 3791.504507] Workqueue: radeon-crtc radeon_flip_work_func [radeon]
  [ 3791.504520] RIP: 0010:radeon_ring_backup+0xb9/0x130 [radeon]

It seems that my HD7750 enters such a state during thermal shutdown. Here
the kernel message with added debug print and fix:

  [ 2930.783094] radeon 0000:01:00.0: ring 3 stalled for more than 10280msec
  [ 2930.783104] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000011194b last fence id 0x000000000011196a on ring 3)
  [ 2931.936653] radeon 0000:01:00.0: Bad ptr 0xffffffff [   -1] for backup
  [ 2931.937704] radeon 0000:01:00.0: GPU softreset: 0x00000BFD
  [ 2931.937705] radeon 0000:01:00.0:   GRBM_STATUS               = 0xFFFFFFFF
  [ 2931.937707] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFFFFFFFF

NAK, that was suggested multiple times now and is essentially the wrong approach.

The problem is that the value is invalid because the hardware is not functional any more. Returning here without backing up the ring just papers over the real problem.

This is just the first occurance of this and you would need to fix a couple of hundred register accesses (both inside and outside of the driver) to make that really work reliable.

The only advice I can give you is to replace the hardware. From experience those symptoms mean that your GPU will die rather soon.

Regards,
Christian.



Signed-off-by: Andreas Messer <andi at bastelmap.de>
---

diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 37093cea24c5..bf55a682442a 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -309,6 +309,12 @@ unsigned radeon_ring_backup(struct radeon_device *rdev, struct radeon_ring *ring
                 return 0;
         }

+       /* ptr could be invalid after thermal shutdown */
+       if (ptr >= (ring->ring_size / 4)) {
+               mutex_unlock(&rdev->ring_lock);
+               return 0;
+       }
+
         size = ring->wptr + (ring->ring_size / 4);
         size -= ptr;
         size &= ring->ptr_mask;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200125/eae9bff2/attachment.htm>