[PATCH] drm/amdgpu: Fix race condition in amdgpu_vm_wait_idle during process kill
Christian König
christian.koenig at amd.com
Thu Aug 7 09:58:38 UTC 2025
On 07.08.25 10:46, Liu01 Tong wrote:
> The early commit b8adc31cc0ca ("drm/amdgpu: Avoid extra evict-restore
> process.") changed amdgpu_vm_wait_idle to use drm_sched_entity_flush
> instead of dma_resv_wait_timeout to avoid KFD eviction fence signaling.
> But this introduce a race condition when processes are killed.
>
> During process kill, drm_sched_entity_flush() will kill the vm entities.
> Concurrent job submissions of this process will fail.
Clear NAK to that. This is essentially why we call drm_sched_entity_flush() here in the first place.
Regards,
Christian.
>
> Fix by skipping vm entity flushing when the process is being killed.
>
> Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 283dd44f04b0..ae43a378f866 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2415,6 +2415,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, uint32_t min_vm_size,
> */
> long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout)
> {
> + /* If the process is being killed, skip flush VM entities
> + * as entities of concurrent job submission of this process
> + * might be in an inconsistent state
> + */
> + if (current->flags & PF_EXITING)
> + return timeout;
> +
> timeout = drm_sched_entity_flush(&vm->immediate, timeout);
> if (timeout <= 0)
> return timeout;
More information about the amd-gfx
mailing list