[PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
Christian König
christian.koenig at amd.com
Thu Mar 13 14:10:05 UTC 2025
Am 12.03.25 um 10:23 schrieb Zhang, Jesse(Jie):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
>
>
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Wednesday, March 12, 2025 4:39 PM
> *To:* Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
> *Cc:* Deucher, Alexander <Alexander.Deucher at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>
> *Subject:* Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
>
>
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
> [SNIP9
>
> -
>
> + gfx_ring->funcs->stop_queue(adev, instance_id);
>
>
>
> Yeah that starts to look good. Question here is who is calling amdgpu_sdma_reset_engine()?
>
>
>
> If this call comes from engine specific code we might not need the start/stop_queue callbacks all together.
>
>
>
> Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA user queue If yes I would rather suggest that the KFD calls the reset function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you don't need those extra functions to abstract starting and stopping of the queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src to distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, */int src/* );
>
> As Alex said in another thread,
>
> We need to distinguish kfd and kgd in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd needs to save its healthy jobs for recovery).
>
I don't think the source of the reset should be relevant to the reset procedure.
The source is basically just the first one who runs into a timeout, that can be both KFD and KGD.
But the cause of the timeout is not necessary the one who signals that a timeout happens.
So as far as I can see you should not have that as parameter either.
>
>
> If we can add a parameter, I am ok for that solution too.
>
>
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine reset for a attempt.
>
Yeah, but that should be trivial.
Regards,
Christian.
>
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>
>
>
>
> Thanks
>
> Jesse
>
>
>
> Regards,
>
> Christian.
>
>
>
> /* Perform the SDMA reset for the specified instance */
>
> ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
> if (ret) {
>
> @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id, b
>
> goto exit;
>
> }
>
>
>
> - /* Invoke all registered post_reset callbacks */
>
> - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
>
> - if (funcs->post_reset) {
>
> - ret = funcs->post_reset(adev, instance_id);
>
> - if (ret) {
>
> - dev_err(adev->dev,
>
> - "afterReset callback failed for instance %u: %d\n",
>
> - instance_id, ret);
>
> - goto exit;
>
> - }
>
> - }
>
> - }
>
> + gfx_ring->funcs->start_queue(adev, instance_id);
>
>
>
> exit:
>
> /* Restart the scheduler's work queue for the GFX and page rings
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> index fd34dc138081..c1f7ccff9c4e 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
>
> .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
> .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
>
> .reset = sdma_v4_4_2_reset_queue,
>
> + .stop_queue = sdma_v4_4_2_stop_queue,
>
> + .start_queue = sdma_v4_4_2_restore_queue,
>
> .is_guilty = sdma_v4_4_2_ring_is_guilty, };
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250313/6ad39aec/attachment-0001.htm>
More information about the amd-gfx
mailing list