[PATCH 1/1] drm/amd: Add per-ring reset for vcn v4.0.5 use
Alex Deucher
alexdeucher at gmail.com
Tue May 6 19:16:53 UTC 2025
On Tue, May 6, 2025 at 3:12 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Tue, May 6, 2025 at 2:19 PM Mario Limonciello
> <mario.limonciello at amd.com> wrote:
> >
> > There is a problem occurring on VCN 4.0.5 where in some situations a job
> > is timing out. This triggers a job timeout which then causes a GPU
> > reset for recovery. That has exposed a number of issues with GPU reset
> > that have since been fixed. But also a GPU reset isn't actually needed
> > for this circumstance. Just restarting the ring is enough.
> >
> > Add a reset callback for the ring which will stop and start VCN if the
> > issue happens.
> >
> > Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909
> > Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > index 558469744f3a..3e6e8127143b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > @@ -1440,6 +1440,24 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
> > }
> > }
> >
> > +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +{
> > + struct amdgpu_device *adev = ring->adev;
> > + int i;
> > +
> > + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> > + struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[i];
> > +
> > + if (ring != &vinst->ring_enc[0])
> > + continue;
>
> You can drop the loop and just look up the instance directly:
> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>
> Also check if per queue reset is supported:
> if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
> return -EOPNOTSUPP;
>
> You'll also need something like:
> adev->vcn.supported_reset =
> amdgpu_get_soft_full_reset_mask(&adev->vcn.inst[0].ring_enc[0]);
> adev->vcn.supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
> in vcn_v4_0_5_sw_init().
>
> Also, since each VCN instance is only single threaded, you could
> theoretically save the other jobs in the ring and fix up the ring
> pointers after resetting to continue after the bad job. That could be
> left as a future improvement however.
While you are at it, you could implement support in vcn_v4_0.c and
vcn_v5_0_0.c as well. Older VCNs will be a bit more complex as they
support multiple queues per engine, so if you reset the engine, you
need to properly clean up all the queues.
Alex
>
> Alex
>
> > + vcn_v4_0_5_stop(vinst);
> > + vcn_v4_0_5_start(vinst);
> > + break;
> > + }
> > +
> > + return amdgpu_ring_test_helper(ring);
> > +}
> > +
> > static struct amdgpu_ring_funcs vcn_v4_0_5_unified_ring_vm_funcs = {
> > .type = AMDGPU_RING_TYPE_VCN_ENC,
> > .align_mask = 0x3f,
> > @@ -1467,6 +1485,7 @@ static struct amdgpu_ring_funcs vcn_v4_0_5_unified_ring_vm_funcs = {
> > .emit_wreg = vcn_v2_0_enc_ring_emit_wreg,
> > .emit_reg_wait = vcn_v2_0_enc_ring_emit_reg_wait,
> > .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
> > + .reset = vcn_v4_0_5_ring_reset,
> > };
> >
> > /**
> > --
> > 2.49.0
> >
More information about the amd-gfx
mailing list