[PATCH 1/1] drm/amd: Add per-ring reset for vcn v4.0.5 use

Tue May 6 19:12:32 UTC 2025

On Tue, May 6, 2025 at 2:19 PM Mario Limonciello
<mario.limonciello at amd.com> wrote:
>
> There is a problem occurring on VCN 4.0.5 where in some situations a job
> is timing out.  This triggers a job timeout which then causes a GPU
> reset for recovery.  That has exposed a number of issues with GPU reset
> that have since been fixed. But also a GPU reset isn't actually needed
> for this circumstance. Just restarting the ring is enough.
>
> Add a reset callback for the ring which will stop and start VCN if the
> issue happens.
>
> Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909
> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> index 558469744f3a..3e6e8127143b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> @@ -1440,6 +1440,24 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
>         }
>  }
>
> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +{
> +       struct amdgpu_device *adev = ring->adev;
> +       int i;
> +
> +       for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> +               struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[i];
> +
> +               if (ring != &vinst->ring_enc[0])
> +                       continue;

You can drop the loop and just look up the instance directly:
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];

Also check if per queue reset is supported:
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
        return -EOPNOTSUPP;

You'll also need something like:
adev->vcn.supported_reset =
amdgpu_get_soft_full_reset_mask(&adev->vcn.inst[0].ring_enc[0]);
adev->vcn.supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
in vcn_v4_0_5_sw_init().

Also, since each VCN instance is only single threaded, you could
theoretically save the other jobs in the ring and fix up the ring
pointers after resetting to continue after the bad job.  That could be
left as a future improvement however.

Alex

> +               vcn_v4_0_5_stop(vinst);
> +               vcn_v4_0_5_start(vinst);
> +               break;
> +       }
> +
> +       return amdgpu_ring_test_helper(ring);
> +}
> +
>  static struct amdgpu_ring_funcs vcn_v4_0_5_unified_ring_vm_funcs = {
>         .type = AMDGPU_RING_TYPE_VCN_ENC,
>         .align_mask = 0x3f,
> @@ -1467,6 +1485,7 @@ static struct amdgpu_ring_funcs vcn_v4_0_5_unified_ring_vm_funcs = {
>         .emit_wreg = vcn_v2_0_enc_ring_emit_wreg,
>         .emit_reg_wait = vcn_v2_0_enc_ring_emit_reg_wait,
>         .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
> +       .reset = vcn_v4_0_5_ring_reset,
>  };
>
>  /**
> --
> 2.49.0
>