[PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction

Tue May 20 13:49:06 UTC 2025

On 5/20/25 15:09, Alex Deucher wrote:
> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher at amd.com> wrote:
>>
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>
>> Stopping the scheduler for queue reset is generally a good idea because
>> it prevents any worker from touching the ring buffer.
>>
>> But using amdgpu_fence_driver_force_completion() before restarting it was
>> a really bad idea because it marked fences as failed while the work was
>> potentially still running.
>>
>> Stop doing that and cleanup the comment a bit.
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index acb21fc8b3ce5..a0fab947143b5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>>                 bool is_guilty;
>>
>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>> -               /* stop the scheduler, but don't mess with the
>> -                * bad job yet because if ring reset fails
>> -                * we'll fall back to full GPU reset.
>> +               dev_err(adev->dev, "Starting %s ring reset\n",
>> +                       s_job->sched->name);
>> +
>> +               /*
>> +                * Stop the scheduler to prevent anybody else from touching the
>> +                * ring buffer.
>>                  */
>>                 drm_sched_wqueue_stop(&ring->sched);
>>
>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>
>>                 r = amdgpu_ring_reset(ring, job->vmid);
>>                 if (!r) {
>> -                       if (amdgpu_ring_sched_ready(ring))
>> -                               drm_sched_stop(&ring->sched, s_job);
>> -                       if (is_guilty) {
>> +                       if (is_guilty)
>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
>> -                               amdgpu_fence_driver_force_completion(ring);
> 
> Do we still need this in the case of rings where we reset the entire
> queue?  E.g., compute or VCN?  In which case we should move this to
> the ring->reset callback.

No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.

And that should be the case at least for both graphics and compute.

Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.

Regards,
Christian.

> 
> Alex
> 
>> -                       }
>> -                       if (amdgpu_ring_sched_ready(ring))
>> -                               drm_sched_start(&ring->sched, 0);
>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>> +                       drm_sched_wqueue_start(&ring->sched);
>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
>> +                               ring->sched.name);
>> +                       drm_dev_wedged_event(adev_to_drm(adev),
>> +                                            DRM_WEDGE_RECOVERY_NONE);
>>                         goto exit;
>>                 }
>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>>         }
>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>
>> --
>> 2.49.0
>>