[RFC PATCH 0/1] Protecting BO list corruption
Luben Tuikov
luben.tuikov at amd.com
Tue Jul 12 05:39:23 UTC 2022
After removing the context lock by patch e68efb27647f21 ("drm/amdgpu:
remove ctx->lock"), we see BO list corruption as documented in the bug of
the link below. While reverting removal of the context lock does fix the
issue, a more comprehensive approach was suggested, which this patch
implements. I'm currently running with this kernel and it works fine,
however running the IGT's amd_cs_nop test, I see a hang in the 4th
sub-test, "sync-gfx0". Previously I've seen it get stuck in the 6th
sub-test, "fork-gfx0".
The hang is generally as follows:
[<0>] ttm_eu_reserve_buffers+0xe7/0x2c0 [ttm]
[<0>] amdgpu_gem_va_ioctl+0x31c/0x540 [amdgpu]
[<0>] drm_ioctl_kernel+0x8c/0x120 [drm]
[<0>] drm_ioctl+0x220/0x3e0 [drm]
[<0>] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[<0>] __x64_sys_ioctl+0x82/0xb0
[<0>] do_syscall_64+0x3b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Generally, something like ttm_eu_reserve_buffers() --> ttm_bo_reserve() -->
... --> dma_resv_lock() --> ww_mutex_lock().
However, while normally using the system, I don't observe such hangs--only
when running the IGT amd_cs_nop test.
Luben Tuikov (1):
drm/amdgpu: Protect the amdgpu_bo_list list with a mutex
drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c | 3 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 4 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 31 +++++++++++++++++++--
3 files changed, 35 insertions(+), 3 deletions(-)
Suggested-by: Christian König <christian.koenig at amd.com>
Cc: Alex Deucher <Alexander.Deucher at amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky at amd.com>
Cc: Vitaly Prosyak <Vitaly.Prosyak at amd.com>
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2048
Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
base-commit: ab7e60938be74e21c723223e7eb96cac7b441e5e
--
2.36.1.74.g277cf0bc36
More information about the amd-gfx
mailing list