[PATCH 2/7] drm/amdgpu: Add ring reset callback for JPEG4_0_3

Fri Jan 31 21:12:47 UTC 2025

Hi Alex, I missed to see your other comment below, answered it inline.

On 1/31/2025 11:11 PM, Sundararaju, Sathishkumar wrote:
> Hi Alex,
>
>
> On 1/31/2025 10:42 PM, Alex Deucher wrote:
>> On Fri, Jan 31, 2025 at 11:32 AM Sathishkumar S
>> <sathishkumar.sundararaju at amd.com> wrote:
>>> Add ring reset function callback for JPEG4_0_3 to
>>> recover from job timeouts without a full gpu reset.
>>>
>>> Signed-off-by: Sathishkumar S <sathishkumar.sundararaju at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 60 
>>> ++++++++++++++++++++++--
>>>   1 file changed, 57 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c 
>>> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> index be0b3b4c8690..62d8628dccc5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> @@ -204,9 +204,7 @@ static int jpeg_v4_0_3_sw_init(struct 
>>> amdgpu_ip_block *ip_block)
>>>          if (r)
>>>                  return r;
>>>
>>> -       /* TODO: Add queue reset mask when FW fully supports it */
>>> -       adev->jpeg.supported_reset =
>>> - amdgpu_get_soft_full_reset_mask(&adev->jpeg.inst[0].ring_dec[0]);
>>> +       adev->jpeg.supported_reset = AMDGPU_RESET_TYPE_PER_PIPE;
>>>          r = amdgpu_jpeg_sysfs_reset_mask_init(adev);
>>>          if (r)
>>>                  return r;
>>> @@ -231,6 +229,7 @@ static int jpeg_v4_0_3_sw_fini(struct 
>>> amdgpu_ip_block *ip_block)
>>>                  return r;
>>>
>>>          amdgpu_jpeg_sysfs_reset_mask_fini(adev);
>>> +
>>>          r = amdgpu_jpeg_sw_fini(adev);
>>>
>>>          return r;
>>> @@ -1099,6 +1098,60 @@ static int 
>>> jpeg_v4_0_3_process_interrupt(struct amdgpu_device *adev,
>>>          return 0;
>>>   }
>>>
>>> +static int jpeg_v4_0_3_wait_for_idle_on_inst(struct amdgpu_ring *ring)
>>> +{
>>> +       struct amdgpu_device *adev = ring->adev;
>>> +       struct amdgpu_ring *r;
>>> +       int ret, j;
>>> +
>>> +       for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>>> +               r = &adev->jpeg.inst[ring->me].ring_dec[j];
>>> +               r->sched.ready = false;
>> I think you want to call drm_sched_wqueue_stop(&r->sched); here
>> instead to stop further submissions to the other rings.  Note that
>> drm_sched_wqueue_stop() is already called from amdgpu_job_timedout()
>> for the queue that kicked this off.  You'll need to start them again
>> after the reset.
> Thank you Alex for the suggestion, I will use this to stop the 
> submissions and start them after reset. This will help save other good 
> jpeg contexts.
>
> Regards,
> Sathish
>>
>>
>>> +       }
>>> +       /* publish update */
>>> +       smp_rmb();
>>> +       for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>>> +               r = &adev->jpeg.inst[ring->me].ring_dec[j];
>>> +               if (r->pipe == j)
>>> +                       continue;
>>> +               ret = SOC15_WAIT_ON_RREG_OFFSET(JPEG, GET_INST(JPEG, 
>>> ring->me),
>>> + regUVD_JRBC0_UVD_JRBC_STATUS,
>>> + jpeg_v4_0_3_core_reg_offset(j),
>>> + UVD_JRBC0_UVD_JRBC_STATUS__RB_JOB_DONE_MASK,
>>> + UVD_JRBC0_UVD_JRBC_STATUS__RB_JOB_DONE_MASK);
>>> +               if (ret)
>>> +                       return ret;
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, 
>>> unsigned int vmid)
>>> +{
>>> +       struct amdgpu_device *adev = ring->adev;
>>> +       struct amdgpu_ring *r;
>>> +       int ret, j;
>>> +
>>> +       if (amdgpu_sriov_vf(adev))
>>> +               return -EINVAL;
>>> +
>>> +       jpeg_v4_0_3_wait_for_idle_on_inst(ring);
>>> +       jpeg_v4_0_3_stop_inst(ring->adev, ring->me);
>>> +       jpeg_v4_0_3_start_inst(ring->adev, ring->me);
>> Is JPEG pipelined or can each engine only process one packet from one
>> queue at a time?  If it's the latter, then when you reset the engine,
>> you'll need to save the current rptrs and wptrs from all of the queues
>> on that engine before you reset it so that you can restore those after
>> you reset and kick them off again where they left off.  That way we
>> don't lose any jobs running on other queues.  SDMA has similar
>> limitations for chips where we can only reset the entire engine.
I fall short to understand your question fully. On every instance, each 
core consumes from its corresponding ring.
And , since I first stop scheduling and wait for idle on the instance on 
all unaffected cores which essentially checks
for rptr=wptr confirming no outstanding jobs before the reset begins, so 
didn't think of save/store, in-fact I realize the
possibility of save/restore after you mention it, thanks for pointing it 
out. I tried per-core reset possibility first, wasn't
able to get it to work, and then resorted to this approach, I am also 
rechecking the per core reset sequence to see if I
can get that to work.

Regards,
Sathish
>>
>> Alex
>>
>>> +       for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>>> +               r = &adev->jpeg.inst[ring->me].ring_dec[j];
>>> +               jpeg_v4_0_3_start_jrbc(r);
>>> +               ret = amdgpu_ring_test_helper(r);
>>> +               if (ret)
>>> +                       return ret;
>>> +               r->sched.ready = true;
>>> +       }
>>> +       /* publish update */
>>> +       smp_rmb();
>>> +       dev_info(adev->dev, "Reset on %s succeeded\n", 
>>> ring->sched.name);
>>> +       return 0;
>>> +}
>>> +
>>>   static const struct amd_ip_funcs jpeg_v4_0_3_ip_funcs = {
>>>          .name = "jpeg_v4_0_3",
>>>          .early_init = jpeg_v4_0_3_early_init,
>>> @@ -1145,6 +1198,7 @@ static const struct amdgpu_ring_funcs 
>>> jpeg_v4_0_3_dec_ring_vm_funcs = {
>>>          .emit_wreg = jpeg_v4_0_3_dec_ring_emit_wreg,
>>>          .emit_reg_wait = jpeg_v4_0_3_dec_ring_emit_reg_wait,
>>>          .emit_reg_write_reg_wait = 
>>> amdgpu_ring_emit_reg_write_reg_wait_helper,
>>> +       .reset = jpeg_v4_0_3_ring_reset,
>>>   };
>>>
>>>   static void jpeg_v4_0_3_set_dec_ring_funcs(struct amdgpu_device 
>>> *adev)
>>> -- 
>>> 2.25.1
>>>
>