[PATCH V2 2/2] drm/amdgpu: read back register after written for VCN v4.0.0 and v5.0.0

Tue May 13 21:13:52 UTC 2025

On 2025-05-13 17:01, Alex Deucher wrote:
> On Tue, May 13, 2025 at 4:33 PM David Wu <davidwu2 at amd.com> wrote:
>>
>> On 2025-05-13 15:29, Alex Deucher wrote:
>>> On Tue, May 13, 2025 at 3:01 PM David Wu <davidwu2 at amd.com> wrote:
>>>> On 2025-05-13 14:40, Alex Deucher wrote:
>>>>
>>>> On Tue, May 13, 2025 at 2:23 PM David (Ming Qiang) Wu <David.Wu3 at amd.com> wrote:
>>>>
>>>> V2: not to add extra read-back in vcn_v4_0_start and vcn_v5_0_0_start as
>>>>       there are read-back calls already. New comments for better understanding.
>>>>
>>>> Similar to the previous changes made for VCN v4.0.5, the addition of
>>>> register read-back support in VCN v4.0.0 and v5.0.0 is intended to
>>>> prevent potential race conditions, even though such issues have not
>>>> been observed yet. This change ensures consistency across different
>>>> VCN variants and helps avoid similar issues on newer or closely
>>>> related GPUs. The overhead introduced by this read-back is negligible.
>>>>
>>>> Signed-off-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
>>>> Reviewed-by: Mario Limonciello <mario.limonciello at amd.com>
>>>>
>>>> Maybe split this into two patches, one for vcn 4 and one for vcn 5.
>>>> That will make it easier to backport to stable.  What about other
>>>> VCNs?
>>>>
>>>> will split.
>>>>
>>>> This applies to those VCNs where regVCN_RB1_DB_CRTL is used for setting doorbell index, which
>>>>
>>>> means VCN 4 and up - all of them are covered (similar code is already there for those not in this patch).
>>> Sure that prevents the doorbell from getting missed, but what about
>>> other registers setup in the VCN start() functions?  What if some of
>>> those are still pending when the doorbell is rung for other VCNs?
>> I think adding a read-back is needed if there is any concern about race
>> condition.
>> If we only concern about start() it should be easy to add. The question
>> is how we will know there is
>> a race condition. Adding read back everywhere when missing after write
>> is not a solution I think.
>> For any VCN functions we need to check carefully
>> (eg. vcn_v4_0_unified_ring_set_wptr()) in case it adds too much
>> overhead and actually not needed (at least haven't seen the issue).
>> Any suggestion as to where we should add or at the moment for _start()?
>> I can work on it for sure or leave it for
>> future improvement.
> I think _start() makes the most sense because it will only be called
> when we power on the VCN instance.  As long as work keeps coming in,
> it will stay on.  The race is theoretically possible on any VCN
> instance.  E.g., when the first VCN job comes in, VCN gets powered on,
> and then we call _start() to program what we need.  After that, we
> ring the doorbell to kick off the job.  The programming sequence in
> _start() could still be in flight on the PCIe bus when the doorbell
> gets rung.  Which is apparently exactly what happens when we hit this
> issue on VCN 4.x and 5.x.  On VCN the doorbell gets ignored because it
> races with the doorbell id register, but on other VCNs, the doorbell
> getting missed may not happen, but it could be something else that
> races, e.g., ring size.
well - right. For production we enable DPG which is now protected.
the other register write at the end of _start() is for non-DPG case.
I can add read-back there just in case.
I guess for other issues/potential race conditions we need to investigate
case by case. Sounds OK?

David

>
> Alex
>
>> David
>>
>>> Alex
>>>
>>>> David
>>>>
>>>> Alex
>>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c   | 4 ++++
>>>>    drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 4 ++++
>>>>    2 files changed, 8 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> index 8fff470bce873..070a2a8cdf6f4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> @@ -1122,6 +1122,10 @@ static int vcn_v4_0_start_dpg_mode(struct amdgpu_vcn_inst *vinst, bool indirect)
>>>>                           ring->doorbell_index << VCN_RB1_DB_CTRL__OFFSET__SHIFT |
>>>>                           VCN_RB1_DB_CTRL__EN_MASK);
>>>>
>>>> +       /* Keeping one read-back to ensure all register writes are done, otherwise
>>>> +        * it may introduce race conditions */
>>>> +       RREG32_SOC15(VCN, inst_idx, regVCN_RB1_DB_CTRL);
>>>> +
>>>>           return 0;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> index 27dcc6f37a730..77c27a317e4c8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> @@ -794,6 +794,10 @@ static int vcn_v5_0_0_start_dpg_mode(struct amdgpu_vcn_inst *vinst,
>>>>                   ring->doorbell_index << VCN_RB1_DB_CTRL__OFFSET__SHIFT |
>>>>                   VCN_RB1_DB_CTRL__EN_MASK);
>>>>
>>>> +       /* Keeping one read-back to ensure all register writes are done, otherwise
>>>> +        * it may introduce race conditions */
>>>> +       RREG32_SOC15(VCN, inst_idx, regVCN_RB1_DB_CTRL);
>>>> +
>>>>           return 0;
>>>>    }
>>>>
>>>> --
>>>> 2.34.1
>>>>