[PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction

Wed May 21 09:03:51 UTC 2025

On 5/20/25 18:38, Alex Deucher wrote:
> On Tue, May 20, 2025 at 9:49 AM Christian König
> <christian.koenig at amd.com> wrote:
>>
>> On 5/20/25 15:09, Alex Deucher wrote:
>>> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher at amd.com> wrote:
>>>>
>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>
>>>> Stopping the scheduler for queue reset is generally a good idea because
>>>> it prevents any worker from touching the ring buffer.
>>>>
>>>> But using amdgpu_fence_driver_force_completion() before restarting it was
>>>> a really bad idea because it marked fences as failed while the work was
>>>> potentially still running.
>>>>
>>>> Stop doing that and cleanup the comment a bit.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index acb21fc8b3ce5..a0fab947143b5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>>>>                 bool is_guilty;
>>>>
>>>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>>>> -               /* stop the scheduler, but don't mess with the
>>>> -                * bad job yet because if ring reset fails
>>>> -                * we'll fall back to full GPU reset.
>>>> +               dev_err(adev->dev, "Starting %s ring reset\n",
>>>> +                       s_job->sched->name);
>>>> +
>>>> +               /*
>>>> +                * Stop the scheduler to prevent anybody else from touching the
>>>> +                * ring buffer.
>>>>                  */
>>>>                 drm_sched_wqueue_stop(&ring->sched);
>>>>
>>>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>
>>>>                 r = amdgpu_ring_reset(ring, job->vmid);
>>>>                 if (!r) {
>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>> -                               drm_sched_stop(&ring->sched, s_job);
>>>> -                       if (is_guilty) {
>>>> +                       if (is_guilty)
>>>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
>>>> -                               amdgpu_fence_driver_force_completion(ring);
>>>
>>> Do we still need this in the case of rings where we reset the entire
>>> queue?  E.g., compute or VCN?  In which case we should move this to
>>> the ring->reset callback.
>>
>> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
>>
>> And that should be the case at least for both graphics and compute.
>>
>> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.
> 
> This seems to result in a deadlock if you reset the entire queue
> rather than just the vmid.   I.e., if you test with just this patch
> and not any of the following patches.  In that case, the queue is
> reset so none of the fences are signaled.

That is just because of a specific behavior of the GFX/Compute engine.

When fences are issued while a reset is ongoing the CP writes the fence value not to the requested location, but rather to fence_addr + 4. See amdgpu_debugfs_fence_info_show for more details.

That's why I cleared the reset before issuing the fence command in the follow up patches.

Key point is that the stuff still executes and telling the core os that it can release the memory by force completing the fences is a really bad idea in that case.

Regards,
Christian.

> 
> Alex
> 
> 
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> Alex
>>>
>>>> -                       }
>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>> -                               drm_sched_start(&ring->sched, 0);
>>>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>>>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>> +                       drm_sched_wqueue_start(&ring->sched);
>>>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
>>>> +                               ring->sched.name);
>>>> +                       drm_dev_wedged_event(adev_to_drm(adev),
>>>> +                                            DRM_WEDGE_RECOVERY_NONE);
>>>>                         goto exit;
>>>>                 }
>>>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>>>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>>>>         }
>>>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>>
>>>> --
>>>> 2.49.0
>>>>
>>