[PATCH] drm/amdgpu: fix task hang from failed job submission during process kill
Christian König
christian.koenig at amd.com
Tue Aug 12 08:07:13 UTC 2025
On 12.08.25 10:00, Liu01 Tong wrote:
> During process kill, drm_sched_entity_flush() will kill the vm
> entities. The following job submissions of this process will fail, and
> the resources of these jobs have not been released, nor have the fences
> been signalled, causing tasks to hang and timeout.
>
> Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
> stopped entity.
Looks good to me, but to just be on the safe side please add another call to amdgpu_vm_ready() to the function amdgpu_cs_vm_handling().
Right before we start updating the VM, e.g. after the amdgpu_vmid_uses_reserved() check for the gang submission and before the call to amdgpu_vm_clear_freed().
Regards,
Christian.
>
> Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
> Signed-off-by: Lin.Cao <lincao12 at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 283dd44f04b0..bf42246a3db2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -654,11 +654,10 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> * Check if all VM PDs/PTs are ready for updates
> *
> * Returns:
> - * True if VM is not evicting.
> + * True if VM is not evicting and all VM entities are not stopped
> */
> bool amdgpu_vm_ready(struct amdgpu_vm *vm)
> {
> - bool empty;
> bool ret;
>
> amdgpu_vm_eviction_lock(vm);
> @@ -666,10 +665,18 @@ bool amdgpu_vm_ready(struct amdgpu_vm *vm)
> amdgpu_vm_eviction_unlock(vm);
>
> spin_lock(&vm->status_lock);
> - empty = list_empty(&vm->evicted);
> + ret &= list_empty(&vm->evicted);
> spin_unlock(&vm->status_lock);
>
> - return ret && empty;
> + spin_lock(&vm->immediate.lock);
> + ret &= !vm->immediate.stopped;
> + spin_unlock(&vm->immediate.lock);
> +
> + spin_lock(&vm->delayed.lock);
> + ret &= !vm->delayed.stopped;
> + spin_unlock(&vm->delayed.lock);
> +
> + return ret;
> }
>
> /**
More information about the amd-gfx
mailing list