[PATCH 0/5] Fix mode1 reset test failures

Felix Kuehling felix.kuehling at amd.com
Wed Feb 26 20:40:11 UTC 2025


The series is

Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>


On 2025-02-26 12:14, Philip Yang wrote:
> mode1 reset test running with compute applications trigger many different
> failures, such as machine reboot, kernel crash with general protection fault,
> NULL pointer access or cpu page fault etc from random calling backtrace.
>
> With KASAN and slub_debug enabled kernel, we capture slub left-redzone
> overwrtten warning, but no KASAN warning before crash. This can confirm there is
> system memory overwritten from GPU, not from CPU.
>
> Change hanghws test to evict user queues first, then do mode1 reset test, no
> crash anymore, this can confirm the system memory overwritten by user queues.
> Because the user queues keep using GPU while KFD cleanup worker free the
> outstanding BOs, the freed system memory is allocated and reused to create job,
> resource. Then the data structure is corrupted by user queue and cause crash.
>
> The fix is in KFD cleanup worker, after evicting all user queues, flush
> reset_domain->wq to ensure ongoing mode1 reset is done or user queues are
> evicted, then free outstanding BOs.
>
> Philip Yang (5):
>    drm/amdkfd: Remove kfd_process_hw_exception worker
>    drm/amdkfd: KFD release_work possible circular locking
>    drm/amdkfd: Fix mode1 reset crash issue
>    drm/amdkfd: Fix pqm_destroy_queue race with GPU reset
>    drm/amdkfd: debugfs hang_hws skip GPU with MES
>
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  5 +++
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 11 +------
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 33 ++++++++++++++-----
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  2 +-
>   5 files changed, 32 insertions(+), 20 deletions(-)
>


More information about the amd-gfx mailing list