[PATCH v2] drm/amdgpu: fix task hang from failed job submission during process kill

Tue Aug 12 20:05:15 UTC 2025

On 8/12/25 1:31 AM, Liu01 Tong wrote:
> During process kill, drm_sched_entity_flush() will kill the vm
> entities. The following job submissions of this process will fail, and
> the resources of these jobs have not been released, nor have the fences
> been signalled, causing tasks to hang and timeout.
> 
> Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
> stopped entity.
> 
> v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
> function amdgpu_cs_vm_handling().
> 
> Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
> Signed-off-by: Lin.Cao <lincao12 at amd.com>

Closes: https://lore.kernel.org/regressions/f2b70e6e-bff6-42f3-82a2-81eed892cc30@linux.dev/
Tested-by: Matthew Schwartz <matthew.schwartz at linux.dev>

Thanks,
Matt

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 15 +++++++++++----
>  2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index e1e48e6f1f35..cdc02860011c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1138,6 +1138,9 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  		}
>  	}
>  
> +	if (!amdgpu_vm_ready(vm))
> +		return -EINVAL;
> +
>  	r = amdgpu_vm_clear_freed(adev, vm, NULL);
>  	if (r)
>  		return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 283dd44f04b0..bf42246a3db2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -654,11 +654,10 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   * Check if all VM PDs/PTs are ready for updates
>   *
>   * Returns:
> - * True if VM is not evicting.
> + * True if VM is not evicting and all VM entities are not stopped
>   */
>  bool amdgpu_vm_ready(struct amdgpu_vm *vm)
>  {
> -	bool empty;
>  	bool ret;
>  
>  	amdgpu_vm_eviction_lock(vm);
> @@ -666,10 +665,18 @@ bool amdgpu_vm_ready(struct amdgpu_vm *vm)
>  	amdgpu_vm_eviction_unlock(vm);
>  
>  	spin_lock(&vm->status_lock);
> -	empty = list_empty(&vm->evicted);
> +	ret &= list_empty(&vm->evicted);
>  	spin_unlock(&vm->status_lock);
>  
> -	return ret && empty;
> +	spin_lock(&vm->immediate.lock);
> +	ret &= !vm->immediate.stopped;
> +	spin_unlock(&vm->immediate.lock);
> +
> +	spin_lock(&vm->delayed.lock);
> +	ret &= !vm->delayed.stopped;
> +	spin_unlock(&vm->delayed.lock);
> +
> +	return ret;
>  }
>  
>  /**