[PATCH v2 2/7] drm/amdgpu: Add ring reset callback for JPEG4_0_3

Tue Feb 11 04:18:26 UTC 2025

Hi Christian,

On 2/11/2025 12:54 AM, Christian König wrote:
> Am 10.02.25 um 15:25 schrieb Sathishkumar S:
>> Add ring reset function callback for JPEG4_0_3 to
>> recover from job timeouts without a full gpu reset.
>>
>> V2:
>>   - sched->ready flag shouldn't be modified by back-ends (Christian)
>>   - use drm_sched_wqueue_stop()/drm_sched_wqueue_start() instead (Alex)
>>
>> Signed-off-by: Sathishkumar S <sathishkumar.sundararaju at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 69 +++++++++++++++++++++---
>>   1 file changed, 63 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c 
>> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> index be0b3b4c8690..b6168f27addd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> @@ -204,14 +204,10 @@ static int jpeg_v4_0_3_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>       if (r)
>>           return r;
>>   -    /* TODO: Add queue reset mask when FW fully supports it */
>> -    adev->jpeg.supported_reset =
>> - amdgpu_get_soft_full_reset_mask(&adev->jpeg.inst[0].ring_dec[0]);
>> +    adev->jpeg.supported_reset = AMDGPU_RESET_TYPE_PER_QUEUE;
>>       r = amdgpu_jpeg_sysfs_reset_mask_init(adev);
>> -    if (r)
>> -        return r;
>>   -    return 0;
>> +    return r;
>>   }
>>     /**
>> @@ -231,6 +227,7 @@ static int jpeg_v4_0_3_sw_fini(struct 
>> amdgpu_ip_block *ip_block)
>>           return r;
>>         amdgpu_jpeg_sysfs_reset_mask_fini(adev);
>> +
>>       r = amdgpu_jpeg_sw_fini(adev);
>>         return r;
>> @@ -1099,6 +1096,65 @@ static int 
>> jpeg_v4_0_3_process_interrupt(struct amdgpu_device *adev,
>>       return 0;
>>   }
>>   +static int jpeg_v4_0_3_wait_for_idle_on_inst(struct amdgpu_ring 
>> *ring)
>> +{
>> +    struct amdgpu_device *adev = ring->adev;
>> +    struct amdgpu_ring *r;
>> +    int ret, j;
>> +
>> +    for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>> +        r = &adev->jpeg.inst[ring->me].ring_dec[j];
>> +
>> +        if (ring->pipe == r->pipe)
>> +            continue;
>> +
>> +        /* pause work submission on this core */
>> +        drm_sched_wqueue_stop(&r->sched);
>
> Again complete NAK for that!
>
> A HW backend should *never* mess with the scheduler front end whatsoever.
>
> What exactly is the requirement here? Do we need to make sure that the 
> JPEG engine isn't touched by the scheduler at all?
Yes, that's the requirement, stop submitting jobs on any ring 
corresponding to this instance, each instance in JPEG4_0_3 has 8 cores
among which one of them is hung at this point, so want the complete 
instance to be idle before reset sequence.

Also per-core reset isn't able to recover from hang I generate with 
wrong command in IB (skipping sequence), or few other known
ways of simulating hangs on jpeg, the reliable method I see so far is to 
reinitialize the JPEG instance (pgcg gate/ungate and reinit, this is
able to recover from any kind of hang), but after the idle check on this 
instance to make sure no active contexts are decoding on any of
the rings/cores on this instance, while they can simultaneously on other 
instances and resume on this when it is back online. I can
first try per-core and up on its failure, fallback to per-instance as 
bigger reset, but stopping the jobs on all rings/cores for this instance
is required for that, without this rest of the good jpeg contexts on 7 
other cores will fail to function as expected.

This requirement is specific to JPEG4_0_3 and a few versions where each 
instance has mutiple jpeg cores that can simultaneously run
decode jobs from a single mjpeg context or multiple individual jpeg 
contexts, both of which can be interchangeably referred to in case of jpeg.

Please suggest, thank you.

Regards,
Sathish
>
> Regards,
> Christian.
>
>> +    }
>> +    for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>> +        r = &adev->jpeg.inst[ring->me].ring_dec[j];
>> +        if (ring->pipe == r->pipe)
>> +            continue;
>> +        /* wait for idle on all cores except on the hung core */
>> +        ret = SOC15_WAIT_ON_RREG_OFFSET(JPEG, GET_INST(JPEG, ring->me),
>> +                        regUVD_JRBC0_UVD_JRBC_STATUS,
>> +                        jpeg_v4_0_3_core_reg_offset(j),
>> + UVD_JRBC0_UVD_JRBC_STATUS__RB_JOB_DONE_MASK,
>> + UVD_JRBC0_UVD_JRBC_STATUS__RB_JOB_DONE_MASK);
>> +        if (ret)
>> +            return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned 
>> int vmid)
>> +{
>> +    struct amdgpu_device *adev = ring->adev;
>> +    struct amdgpu_ring *r;
>> +    int ret, j;
>> +
>> +    if (amdgpu_sriov_vf(adev))
>> +        return -EINVAL;
>> +
>> +    ret = jpeg_v4_0_3_wait_for_idle_on_inst(ring);
>> +    if (ret)
>> +        return ret;
>> +
>> +    jpeg_v4_0_3_stop_inst(ring->adev, ring->me);
>> +    jpeg_v4_0_3_start_inst(ring->adev, ring->me);
>> +    for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
>> +        r = &adev->jpeg.inst[ring->me].ring_dec[j];
>> +        jpeg_v4_0_3_start_jrbc(r);
>> +        ret = amdgpu_ring_test_helper(r);
>> +        if (ret)
>> +            return ret;
>> +        else if (ring->pipe != r->pipe)
>> +            drm_sched_wqueue_start(&r->sched);
>> +    }
>> +    return 0;
>> +}
>> +
>>   static const struct amd_ip_funcs jpeg_v4_0_3_ip_funcs = {
>>       .name = "jpeg_v4_0_3",
>>       .early_init = jpeg_v4_0_3_early_init,
>> @@ -1145,6 +1201,7 @@ static const struct amdgpu_ring_funcs 
>> jpeg_v4_0_3_dec_ring_vm_funcs = {
>>       .emit_wreg = jpeg_v4_0_3_dec_ring_emit_wreg,
>>       .emit_reg_wait = jpeg_v4_0_3_dec_ring_emit_reg_wait,
>>       .emit_reg_write_reg_wait = 
>> amdgpu_ring_emit_reg_write_reg_wait_helper,
>> +    .reset = jpeg_v4_0_3_ring_reset,
>>   };
>>     static void jpeg_v4_0_3_set_dec_ring_funcs(struct amdgpu_device 
>> *adev)
>