[PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
Alex Deucher
alexdeucher at gmail.com
Thu Mar 13 14:38:06 UTC 2025
I think as long as the locking is correct, the src shouldn't matter.
You just need to stop the kernel queues (and save state) and evict the
user queues (since HWS is responsible for saving the MQDs of the
non-guilty user queues). If KFD detected the hang (e.g., queue
eviction fails when called for memory pressure, etc.), we just need to
make sure that it's ok for the sdma reset routine to call evict queues
again even if it was already called (presumably it should exit early
since the queues are already evicted). If KGD initiates the reset, it
will call into KFD to evict queues. We just need to make sure the
evict queues function we call just evicts the queues and doesn't also
try and reset.
Alex
On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) <Jesse.Zhang at amd.com> wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
>
>
>
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Wednesday, March 12, 2025 4:39 PM
> To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>
> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
>
>
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
> [SNIP9
>
> -
>
> + gfx_ring->funcs->stop_queue(adev, instance_id);
>
>
>
> Yeah that starts to look good. Question here is who is calling amdgpu_sdma_reset_engine()?
>
>
>
> If this call comes from engine specific code we might not need the start/stop_queue callbacks all together.
>
>
>
> Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA user queue If yes I would rather suggest that the KFD calls the reset function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you don't need those extra functions to abstract starting and stopping of the queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src to distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
>
> As Alex said in another thread,
>
> We need to distinguish kfd and kgd in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd needs to save its healthy jobs for recovery).
>
>
>
> If we can add a parameter, I am ok for that solution too.
>
>
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine reset for a attempt.
>
>
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>
>
>
>
> Thanks
>
> Jesse
>
>
>
> Regards,
>
> Christian.
>
>
>
> /* Perform the SDMA reset for the specified instance */
>
> ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
> if (ret) {
>
> @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id, b
>
> goto exit;
>
> }
>
>
>
> - /* Invoke all registered post_reset callbacks */
>
> - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
>
> - if (funcs->post_reset) {
>
> - ret = funcs->post_reset(adev, instance_id);
>
> - if (ret) {
>
> - dev_err(adev->dev,
>
> - "afterReset callback failed for instance %u: %d\n",
>
> - instance_id, ret);
>
> - goto exit;
>
> - }
>
> - }
>
> - }
>
> + gfx_ring->funcs->start_queue(adev, instance_id);
>
>
>
> exit:
>
> /* Restart the scheduler's work queue for the GFX and page rings
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> index fd34dc138081..c1f7ccff9c4e 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
>
> .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
> .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
>
> .reset = sdma_v4_4_2_reset_queue,
>
> + .stop_queue = sdma_v4_4_2_stop_queue,
>
> + .start_queue = sdma_v4_4_2_restore_queue,
>
> .is_guilty = sdma_v4_4_2_ring_is_guilty, };
>
>
>
>
>
>
More information about the amd-gfx
mailing list