[PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Mon Aug 11 12:16:32 UTC 2025

Hi Esther,

but that is harmless and potentially only gives a warning in the system log.

You could adjust amdgpu_vm_ready() if necessary.

Regards,
Christian.

On 11.08.25 11:05, Liu01, Tong (Esther) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Hi Christian,
> 
> The real issue is a race condition during process exit after patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2. This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush instead of dma_resv_wait_timeout. Here is what happens:
> 
> do_exit
>     |
>     exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... drm_sched_entity_flush (kills entity)
>     ...
>     exit_task_work(tsk) ...amdgpu_gem_object_close  ...  amdgpu_vm_clear_freed (tries to submit to killed entity)
> 
> The entity gets killed in amdgpu_vm_wait_idle(), but amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit jobs.
> 
> Kind regards,
> Esther
> 
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Monday, August 11, 2025 4:25 PM
> To: Liu01, Tong (Esther) <Tong.Liu01 at amd.com>; dri-devel at lists.freedesktop.org
> Cc: phasta at kernel.org; dakr at kernel.org; matthew.brost at intel.com; Ba, Gang <Gang.Ba at amd.com>; matthew.schwartz at linux.dev; cao, lin <lin.cao at amd.com>; cao, lin <lin.cao at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill
> 
> On 11.08.25 09:20, Liu01 Tong wrote:
>> During process kill, drm_sched_entity_flush() will kill the vm
>> entities. The following job submissions of this process will fail
> 
> Well when the process is killed how can it still make job submissions?
> 
> Regards,
> Christian.
> 
>> , and
>> the resources of these jobs have not been released, nor have the
>> fences  been signalled, causing tasks to hang.
>>
>> Fix by not doing job init when the entity is stopped. And when the job
>> is already submitted, free the job resource if the entity is stopped.
>>
>> Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
>> Signed-off-by: Lin.Cao <lincao12 at amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
>>  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
>>  2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index ac678de7fe5e..1e744b2eb2db 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>       bool first;
>>       ktime_t submit_ts;
>>
>> +     if (entity->stopped) {
>> +             DRM_ERROR("Trying to push job to a killed entity\n");
>> +             INIT_WORK(&sched_job->work, drm_sched_entity_kill_jobs_work);
>> +             schedule_work(&sched_job->work);
>> +             return;
>> +     }
>> +
>>       trace_drm_sched_job(sched_job, entity);
>>       atomic_inc(entity->rq->sched->score);
>>       WRITE_ONCE(entity->last_user, current->group_leader); @@ -589,12
>> +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
>> *sched_job)
>>
>>               /* Add the entity to the run queue */
>>               spin_lock(&entity->lock);
>> -             if (entity->stopped) {
>> -                     spin_unlock(&entity->lock);
>> -
>> -                     DRM_ERROR("Trying to push to a killed entity\n");
>> -                     return;
>> -             }
>>
>>               rq = entity->rq;
>>               sched = rq->sched;
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index bfea608a7106..c15b17d9ffe3 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>               return -ENOENT;
>>       }
>>
>> +     if (unlikely(entity->stopped)) {
>> +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
>> +             return -EINVAL;
>> +     }
>> +
>>       if (unlikely(!credits)) {
>>               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
>>               return -EINVAL;
>