[PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)

Thu Aug 14 09:11:11 UTC 2025

Hi Lijo,

On 8/14/2025 2:11 PM, Lazar, Lijo wrote:
> [Public]
>
> We already have a per instance power state that can be tracked. What about something like attached?

This also has concurrent access of the power state , 
vcn.inst[i].cur_state is not protected by workload_profile_mutex

every where, it can still change while you are holding 
workload_profile_mutex and checking it.

Regards,

Sathish

>
> Thanks,
> Lijo
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Sundararaju, Sathishkumar
> Sent: Thursday, August 14, 2025 4:43 AM
> To: Alex Deucher <alexdeucher at gmail.com>
> Cc: Wu, David <David.Wu3 at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)
>
>
> On 8/14/2025 3:38 AM, Alex Deucher wrote:
>> On Wed, Aug 13, 2025 at 5:16 PM Sundararaju, Sathishkumar
>> <sathishkumar.sundararaju at amd.com> wrote:
>>> On 8/14/2025 2:33 AM, Alex Deucher wrote:
>>>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
>>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>>> On 8/14/2025 1:35 AM, Alex Deucher wrote:
>>>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
>>>>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>>>>> Hi Alex, Hi David,
>>>>>>>
>>>>>>> I see David's concern but his suggestion yet wont solve the
>>>>>>> problem, neither the current form , reason :-
>>>>>>>
>>>>>>> The emitted fence count and total submission count are fast
>>>>>>> transients which frequently become 0 in between video decodes
>>>>>>> (between jobs) even with the atomics and locks there can be a
>>>>>>> switch of video power profile, in the current form of patch that
>>>>>>> window is minimized, but still can happen if stress tested. But
>>>>>>> power state of any instance becoming zero
>>>>>> Can you explain how this can happen?  I'm not seeing it.
>>>>> Consider this situation, inst0 and inst1 actively decoding, inst0
>>>>> decode completes, delayed idle work starts.
>>>>> inst0 idle handler can read 0 total fences and 0 total submission
>>>>> count, even if inst1 is actively decoding, that's between the jobs,
>>>>>      - as begin_use increaments vcn.total_submission_cnt and end_use
>>>>> decreaments vcn.total_submission_cnt that can be 0.
>>>>>      - if outstanding fences are cleared and no new emitted fence,
>>>>> between jobs , can be 0.
>>>>>      - both of the above conditions do not mean video decode is
>>>>> complete on inst1, it is actively decoding.
>>>> How can there be active decoding without an outstanding fence?  In
>>>> that case, total_fences (fences from both instances) would be non-0.
>>> I mean on inst1 the job scheduled is already complete, so 0
>>> outstanding fences, newer job is yet to be scheduled
>>>
>>> and commited to ring (inst1) , this doesn't mean decode has stopped
>>> on
>>> inst1 right (I am saying if timing of inst0 idle work coincides with
>>> this),
>>>
>>> Or am I wrong in assuming this ? Can't this ever happen ? Please
>>> correct my understanding here.
>> The flow looks like:
>>
>> begin_use(inst)
>> emit_fence(inst)
>> end_use(inst)
>>
>> ...later
>> fence signals
>> ...later
>> work handler
>>
>> In begin_use we increment the global and per instance submission.
>> This protects the power gating and profile until end_use.  In end use
>> we decrement the submissions because we don't need to protect anything
>> any more as we have the fence that was submitted via the ring.  That
>> fence won't signal until the job is complete.
> Is a next begin_use always guaranteed to be run before current job fence signals ?
>
> if not then both total submission and total fence are zero , example delayed job/packet submissions
>
> from user space, or next job schedule happens after current job fence signals.
>
> if this is never possible then (v3) is perfect.
>
> Regards,
>
> Sathish
>
>> For power gating, we
>> only care about the submission count and fences for that instance, for
>> the profile, we care about submission count and fences all instances.
>> Once the fences have signalled, the outstanding fences will be 0 and
>> there won't be any active work.
>>
>> Alex
>>
>>> Regards,
>>>
>>> Sathish
>>>
>>>> Alex
>>>>
>>>>> Whereas if instances are powered off we are sure idle time is past
>>>>> and it is powered off, no possible way of active video decode, when
>>>>> all instances are off we can safely assume no active decode and
>>>>> global lock protects it against new begin_use on any instance. But
>>>>> the only distant concern is global common locks w.r.t perf, but we
>>>>> are already having a global workprofile mutex , so there shouldn't
>>>>> be any drop in perf, with just one single global lock for all
>>>>> instances.
>>>>>
>>>>> Just sending out a patch with this fix, will leave it to you to
>>>>> decide the right method. If you think outstanding total fences can
>>>>> never be 0 during decode, then your previous version (v3) itself is
>>>>> good, there is no real benefit of splitting the handlers as such.
>>>>>
>>>>> Regards,
>>>>> Sathish
>>>>>> If it is possible, maybe it would be easier to just split the
>>>>>> profile and powergating into separate handlers.  The profile one
>>>>>> would be global and the powergating one would be per instance.
>>>>>> See the attached patches.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>> can be a sure shot indication of break in a video decode, the
>>>>>>> mistake in my patch was using per instance mutex, I should have
>>>>>>> used a common global mutex, then that covers the situation David is trying to bring out.
>>>>>>>
>>>>>>> Using one global vcn.pg_lock for idle and begin_use and using
>>>>>>> flags to track power state could help us totally avoid this situation.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Sathish
>>>>>>>
>>>>>>> On 8/13/2025 11:46 PM, Wu, David wrote:
>>>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
>>>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <davidwu2 at amd.com> wrote:
>>>>>>>>>> Hi Alex,
>>>>>>>>>>
>>>>>>>>>> The addition of  total_submission_cnt should work - in that it
>>>>>>>>>> is unlikely to have a context switch right after the begin_use().
>>>>>>>>>> The suggestion of moving it inside the lock (which I prefer in
>>>>>>>>>> case someone adds more before the lock and not reviewed
>>>>>>>>>> thoroughly)
>>>>>>>>>>        - up to you to decide.
>>>>>>>>>>
>>>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> David
>>>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
>>>>>>>>>>> If there are multiple instances of the VCN running, we may
>>>>>>>>>>> end up switching the video profile while another instance is
>>>>>>>>>>> active because we only take into account the current
>>>>>>>>>>> instance's submissions.  Look at all outstanding fences for
>>>>>>>>>>> the video profile.
>>>>>>>>>>>
>>>>>>>>>>> v2: drop early exit in begin_use()
>>>>>>>>>>> v3: handle possible race between begin_use() work handler
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
>>>>>>>>>>> handling")
>>>>>>>>>>> Reviewed-by: Sathishkumar S
>>>>>>>>>>> <sathishkumar.sundararaju at amd.com> (v1)
>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
>>>>>>>>>>> ++++++++++++-------------
>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
>>>>>>>>>>>        2 files changed, 21 insertions(+), 20 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>> @@ -415,19 +415,25 @@ static void
>>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>>            struct amdgpu_vcn_inst *vcn_inst =
>>>>>>>>>>>                    container_of(work, struct amdgpu_vcn_inst,
>>>>>>>>>>> idle_work.work);
>>>>>>>>>>>            struct amdgpu_device *adev = vcn_inst->adev;
>>>>>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>>> -     unsigned int i = vcn_inst->inst, j;
>>>>>>>>>>> +     unsigned int total_fences = 0,
>>>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>>> +     unsigned int i, j;
>>>>>>>>>>>            int r = 0;
>>>>>>>>>>>
>>>>>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
>>>>>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
>>>>>>>>>>>                    return;
>>>>>>>>>>>
>>>>>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
>>>>>>>>>>> -             fence[i] +=
>>>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
>>>>>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>>>>>>>>>> +             struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
>>>>>>>>>>> +
>>>>>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
>>>>>>>>>>> +                     fence[i] +=
>>>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
>>>>>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
>>>>>>>>>>> +             total_fences += fence[i];
>>>>>>>>>>> +     }
>>>>>>>>>>>
>>>>>>>>>>>            /* Only set DPG pause for VCN3 or below, VCN4 and
>>>>>>>>>>> above will be handled by FW */
>>>>>>>>>>>            if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
>>>>>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
>>>>>>>>>>> +         !vcn_inst->using_unified_queue) {
>>>>>>>>>>>                    struct dpg_pause_state new_state;
>>>>>>>>>>>
>>>>>>>>>>>                    if (fence[i] || @@ -436,18 +442,18 @@
>>>>>>>>>>> static void amdgpu_vcn_idle_work_handler(struct work_struct
>>>>>>>>>>> *work)
>>>>>>>>>>>                    else
>>>>>>>>>>>                            new_state.fw_based =
>>>>>>>>>>> VCN_DPG_STATE__UNPAUSE;
>>>>>>>>>>>
>>>>>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>>>            }
>>>>>>>>>>>
>>>>>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
>>>>>>>>>>> -     fences += fence[i];
>>>>>>>>>>> -
>>>>>>>>>>> -     if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>> +     if (!fence[vcn_inst->inst] &&
>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>> +             /* This is specific to this instance */
>>>>>>>>>>>                    mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>                    vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
>>>>>>>>>>>                    mutex_unlock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>> -             if (adev->vcn.workload_profile_active) {
>>>>>>>>>>> +             /* This is global and depends on all VCN instances */
>>>>>>>>>>> +             if (adev->vcn.workload_profile_active &&
>>>>>>>>>>> !total_fences &&
>>>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
>>>>>>>>>>>                            r =
>>>>>>>>>>> amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>>> false);
>>>>>>>>>>>                            if (r) @@ -467,16 +473,10 @@ void
>>>>>>>>>>> amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring)
>>>>>>>>>>>            int r = 0;
>>>>>>>>>>>
>>>>>>>>>>>            atomic_inc(&vcn_inst->total_submission_cnt);
>>>>>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>> move this addition down inside the mutex lock
>>>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
>>>>>>>>>>>
>>>>>>>>>>> -     /* We can safely return early here because we've cancelled the
>>>>>>>>>>> -      * the delayed work so there is no one else to set it to false
>>>>>>>>>>> -      * and we don't care if someone else sets it to true.
>>>>>>>>>>> -      */
>>>>>>>>>>> -     if (adev->vcn.workload_profile_active)
>>>>>>>>>>> -             goto pg_lock;
>>>>>>>>>>> -
>>>>>>>>>>>            mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>> move to here:
>>>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>> I think this should work for multiple instances.
>>>>>>>>> Why does this need to be protected by the mutex?
>>>>>>>> hmm.. OK - no need and it is actually better before the mutex.
>>>>>>>> David
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>> David
>>>>>>>>>>>            if (!adev->vcn.workload_profile_active) {
>>>>>>>>>>>                    r = amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>>>>>> amdgpu_ring *ring)
>>>>>>>>>>>            }
>>>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>
>>>>>>>>>>> -pg_lock:
>>>>>>>>>>>            mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>            vcn_inst->set_pg_state(vcn_inst,
>>>>>>>>>>> AMD_PG_STATE_UNGATE);
>>>>>>>>>>>
>>>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct
>>>>>>>>>>> amdgpu_ring
>>>>>>>>>>> *ring)
>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission
>>>>>>>>>>> _cnt);
>>>>>>>>>>>
>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_c
>>>>>>>>>>> nt);
>>>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
>>>>>>>>>>>
>>>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
>>>>>>>>>>>                                  VCN_IDLE_TIMEOUT); diff --git
>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
>>>>>>>>>>>
>>>>>>>>>>>            uint16_t inst_mask;
>>>>>>>>>>>            uint8_t num_inst_per_aid;
>>>>>>>>>>> +     atomic_t                total_submission_cnt;
>>>>>>>>>>>
>>>>>>>>>>>            /* IP reg dump */
>>>>>>>>>>>            uint32_t                *ip_dump;