[PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

Christian König christian.koenig at amd.com
Thu Mar 13 14:10:05 UTC 2025


Am 12.03.25 um 10:23 schrieb Zhang, Jesse(Jie):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>  
>
>  
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Wednesday, March 12, 2025 4:39 PM
> *To:* Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
> *Cc:* Deucher, Alexander <Alexander.Deucher at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>
> *Subject:* Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
>
>  
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
>     [SNIP9
>
>         -
>
>         +     gfx_ring->funcs->stop_queue(adev, instance_id);
>
>      
>
>     Yeah that starts to look good. Question here is who is calling amdgpu_sdma_reset_engine()?
>
>      
>
>     If this call comes from engine specific code we might not need the start/stop_queue callbacks all together.
>
>      
>
>         Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA user queue  If yes I would rather suggest that the KFD calls the reset function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you don't need those extra functions to abstract starting and stopping of the queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src  to distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, */int src/* );
>
> As Alex said in another thread,
>
> We need to distinguish  kfd and kgd  in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd needs to save its healthy jobs for recovery).
>

I don't think the source of the reset should be relevant to the reset procedure.

The source is basically just the first one who runs into a timeout, that can be both KFD and KGD.

But the cause of the timeout is not necessary the one who signals that a timeout happens.

So as far as I can see you should not have that as parameter either.

>  
>
> If we can add a parameter, I am ok for that solution too.
>
>  
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine reset for a attempt.
>

Yeah, but that should be trivial.

Regards,
Christian.

>  
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>      
>
>      
>
>     Thanks
>
>     Jesse
>
>      
>
>     Regards,
>
>     Christian.
>
>      
>
>               /* Perform the SDMA reset for the specified instance */
>
>               ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
>               if (ret) {
>
>         @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id, b
>
>                       goto exit;
>
>               }
>
>          
>
>         -     /* Invoke all registered post_reset callbacks */
>
>         -     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
>
>         -             if (funcs->post_reset) {
>
>         -                     ret = funcs->post_reset(adev, instance_id);
>
>         -                     if (ret) {
>
>         -                             dev_err(adev->dev,
>
>         -                             "afterReset callback failed for instance %u: %d\n",
>
>         -                                     instance_id, ret);
>
>         -                             goto exit;
>
>         -                     }
>
>         -             }
>
>         -     }
>
>         +     gfx_ring->funcs->start_queue(adev, instance_id);
>
>          
>
>          exit:
>
>               /* Restart the scheduler's work queue for the GFX and page rings
>
>         diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         index fd34dc138081..c1f7ccff9c4e 100644
>
>         --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
>
>               .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
>               .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
>
>               .reset = sdma_v4_4_2_reset_queue,
>
>         +     .stop_queue = sdma_v4_4_2_stop_queue,
>
>         +     .start_queue = sdma_v4_4_2_restore_queue,
>
>               .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
>
>          
>
>      
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250313/6ad39aec/attachment-0001.htm>


More information about the amd-gfx mailing list