[PATCH] drm/amdgpu: Fix race condition in amdgpu_vm_wait_idle during process kill

Thu Aug 7 09:58:38 UTC 2025

On 07.08.25 10:46, Liu01 Tong wrote:
> The early commit b8adc31cc0ca ("drm/amdgpu: Avoid extra evict-restore
> process.") changed amdgpu_vm_wait_idle to use drm_sched_entity_flush
> instead of dma_resv_wait_timeout to avoid KFD eviction fence signaling.
> But this introduce a race condition when processes are killed.
> 
> During process kill, drm_sched_entity_flush() will kill the vm entities.
> Concurrent job submissions of this process will fail.

Clear NAK to that. This is essentially why we call drm_sched_entity_flush() here in the first place.

Regards,
Christian.

> 
> Fix by skipping vm entity flushing when the process is being killed.
> 
> Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 283dd44f04b0..ae43a378f866 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2415,6 +2415,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, uint32_t min_vm_size,
>   */
>  long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout)
>  {
> +	/* If the process is being killed, skip flush VM entities
> +	 * as entities of concurrent job submission of this process
> +	 * might be in an inconsistent state
> +	 */
> +	if (current->flags & PF_EXITING)
> +		return timeout;
> +
>  	timeout = drm_sched_entity_flush(&vm->immediate, timeout);
>  	if (timeout <= 0)
>  		return timeout;