[PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)

Wed Aug 13 23:13:20 UTC 2025

On 8/14/2025 3:38 AM, Alex Deucher wrote:
> On Wed, Aug 13, 2025 at 5:16 PM Sundararaju, Sathishkumar
> <sathishkumar.sundararaju at amd.com> wrote:
>>
>> On 8/14/2025 2:33 AM, Alex Deucher wrote:
>>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>> On 8/14/2025 1:35 AM, Alex Deucher wrote:
>>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
>>>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>>>> Hi Alex, Hi David,
>>>>>>
>>>>>> I see David's concern but his suggestion yet wont solve the problem,
>>>>>> neither the current form , reason :-
>>>>>>
>>>>>> The emitted fence count and total submission count are fast transients
>>>>>> which frequently become 0 in between video decodes (between jobs) even
>>>>>> with the atomics and locks there can be a switch of video power profile,
>>>>>> in the current form of patch that window is minimized, but still can
>>>>>> happen if stress tested. But power state of any instance becoming zero
>>>>> Can you explain how this can happen?  I'm not seeing it.
>>>> Consider this situation, inst0 and inst1 actively decoding, inst0 decode
>>>> completes, delayed idle work starts.
>>>> inst0 idle handler can read 0 total fences and 0 total submission count,
>>>> even if inst1 is actively decoding,
>>>> that's between the jobs,
>>>>     - as begin_use increaments vcn.total_submission_cnt and end_use
>>>> decreaments vcn.total_submission_cnt that can be 0.
>>>>     - if outstanding fences are cleared and no new emitted fence, between
>>>> jobs , can be 0.
>>>>     - both of the above conditions do not mean video decode is complete on
>>>> inst1, it is actively decoding.
>>> How can there be active decoding without an outstanding fence?  In
>>> that case, total_fences (fences from both instances) would be non-0.
>> I mean on inst1 the job scheduled is already complete, so 0 outstanding
>> fences, newer job is yet to be scheduled
>>
>> and commited to ring (inst1) , this doesn't mean decode has stopped on
>> inst1 right (I am saying if timing of inst0 idle work coincides with this),
>>
>> Or am I wrong in assuming this ? Can't this ever happen ? Please correct
>> my understanding here.
> The flow looks like:
>
> begin_use(inst)
> emit_fence(inst)
> end_use(inst)
>
> ...later
> fence signals
> ...later
> work handler
>
> In begin_use we increment the global and per instance submission.
> This protects the power gating and profile until end_use.  In end use
> we decrement the submissions because we don't need to protect anything
> any more as we have the fence that was submitted via the ring.  That
> fence won't signal until the job is complete.

Is a next begin_use always guaranteed to be run before current job fence 
signals ?

if not then both total submission and total fence are zero , example 
delayed job/packet submissions

from user space, or next job schedule happens after current job fence 
signals.

if this is never possible then (v3) is perfect.

Regards,

Sathish

> For power gating, we
> only care about the submission count and fences for that instance, for
> the profile, we care about submission count and fences all instances.
> Once the fences have signalled, the outstanding fences will be 0 and
> there won't be any active work.
>
> Alex
>
>> Regards,
>>
>> Sathish
>>
>>> Alex
>>>
>>>> Whereas if instances are powered off we are sure idle time is past and
>>>> it is powered off, no possible way of
>>>> active video decode, when all instances are off we can safely assume no
>>>> active decode and global lock protects
>>>> it against new begin_use on any instance. But the only distant concern
>>>> is global common locks w.r.t perf, but we
>>>> are already having a global workprofile mutex , so there shouldn't be
>>>> any drop in perf, with just one single
>>>> global lock for all instances.
>>>>
>>>> Just sending out a patch with this fix, will leave it to you to decide
>>>> the right method. If you think outstanding total fences
>>>> can never be 0 during decode, then your previous version (v3) itself is
>>>> good, there is no real benefit of splitting the handlers as such.
>>>>
>>>> Regards,
>>>> Sathish
>>>>> If it is possible, maybe it would be easier to just split the profile
>>>>> and powergating into separate handlers.  The profile one would be
>>>>> global and the powergating one would be per instance.  See the
>>>>> attached patches.
>>>>>
>>>>> Alex
>>>>>
>>>>>> can be a sure shot indication of break in a video decode, the mistake in
>>>>>> my patch was using per instance mutex, I should have used a common
>>>>>> global mutex, then that covers the situation David is trying to bring out.
>>>>>>
>>>>>> Using one global vcn.pg_lock for idle and begin_use and using flags to
>>>>>> track power state could help us totally avoid this situation.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Sathish
>>>>>>
>>>>>> On 8/13/2025 11:46 PM, Wu, David wrote:
>>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
>>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <davidwu2 at amd.com> wrote:
>>>>>>>>> Hi Alex,
>>>>>>>>>
>>>>>>>>> The addition of  total_submission_cnt should work - in that
>>>>>>>>> it is unlikely to have a context switch right after the begin_use().
>>>>>>>>> The suggestion of moving it inside the lock (which I prefer in case
>>>>>>>>> someone
>>>>>>>>> adds more before the lock and not reviewed thoroughly)
>>>>>>>>>       - up to you to decide.
>>>>>>>>>
>>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
>>>>>>>>>> If there are multiple instances of the VCN running,
>>>>>>>>>> we may end up switching the video profile while another
>>>>>>>>>> instance is active because we only take into account
>>>>>>>>>> the current instance's submissions.  Look at all
>>>>>>>>>> outstanding fences for the video profile.
>>>>>>>>>>
>>>>>>>>>> v2: drop early exit in begin_use()
>>>>>>>>>> v3: handle possible race between begin_use() work handler
>>>>>>>>>>
>>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
>>>>>>>>>> handling")
>>>>>>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju at amd.com> (v1)
>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
>>>>>>>>>> ++++++++++++-------------
>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
>>>>>>>>>>       2 files changed, 21 insertions(+), 20 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>> @@ -415,19 +415,25 @@ static void
>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>           struct amdgpu_vcn_inst *vcn_inst =
>>>>>>>>>>                   container_of(work, struct amdgpu_vcn_inst,
>>>>>>>>>> idle_work.work);
>>>>>>>>>>           struct amdgpu_device *adev = vcn_inst->adev;
>>>>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>> -     unsigned int i = vcn_inst->inst, j;
>>>>>>>>>> +     unsigned int total_fences = 0,
>>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>> +     unsigned int i, j;
>>>>>>>>>>           int r = 0;
>>>>>>>>>>
>>>>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
>>>>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
>>>>>>>>>>                   return;
>>>>>>>>>>
>>>>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
>>>>>>>>>> -             fence[i] +=
>>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
>>>>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>>>>>>>>> +             struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
>>>>>>>>>> +
>>>>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
>>>>>>>>>> +                     fence[i] +=
>>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
>>>>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
>>>>>>>>>> +             total_fences += fence[i];
>>>>>>>>>> +     }
>>>>>>>>>>
>>>>>>>>>>           /* Only set DPG pause for VCN3 or below, VCN4 and above will
>>>>>>>>>> be handled by FW */
>>>>>>>>>>           if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
>>>>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
>>>>>>>>>> +         !vcn_inst->using_unified_queue) {
>>>>>>>>>>                   struct dpg_pause_state new_state;
>>>>>>>>>>
>>>>>>>>>>                   if (fence[i] ||
>>>>>>>>>> @@ -436,18 +442,18 @@ static void
>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>                   else
>>>>>>>>>>                           new_state.fw_based = VCN_DPG_STATE__UNPAUSE;
>>>>>>>>>>
>>>>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
>>>>>>>>>> -     fences += fence[i];
>>>>>>>>>> -
>>>>>>>>>> -     if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>> +     if (!fence[vcn_inst->inst] &&
>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>> +             /* This is specific to this instance */
>>>>>>>>>>                   mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>                   vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
>>>>>>>>>>                   mutex_unlock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>> -             if (adev->vcn.workload_profile_active) {
>>>>>>>>>> +             /* This is global and depends on all VCN instances */
>>>>>>>>>> +             if (adev->vcn.workload_profile_active &&
>>>>>>>>>> !total_fences &&
>>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
>>>>>>>>>>                           r = amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>> false);
>>>>>>>>>>                           if (r)
>>>>>>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>>>>> amdgpu_ring *ring)
>>>>>>>>>>           int r = 0;
>>>>>>>>>>
>>>>>>>>>>           atomic_inc(&vcn_inst->total_submission_cnt);
>>>>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>> move this addition down inside the mutex lock
>>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
>>>>>>>>>>
>>>>>>>>>> -     /* We can safely return early here because we've cancelled the
>>>>>>>>>> -      * the delayed work so there is no one else to set it to false
>>>>>>>>>> -      * and we don't care if someone else sets it to true.
>>>>>>>>>> -      */
>>>>>>>>>> -     if (adev->vcn.workload_profile_active)
>>>>>>>>>> -             goto pg_lock;
>>>>>>>>>> -
>>>>>>>>>>           mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>> move to here:
>>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>> I think this should work for multiple instances.
>>>>>>>> Why does this need to be protected by the mutex?
>>>>>>> hmm.. OK - no need and it is actually better before the mutex.
>>>>>>> David
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>> David
>>>>>>>>>>           if (!adev->vcn.workload_profile_active) {
>>>>>>>>>>                   r = amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>>>>> amdgpu_ring *ring)
>>>>>>>>>>           }
>>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>
>>>>>>>>>> -pg_lock:
>>>>>>>>>>           mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>           vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE);
>>>>>>>>>>
>>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring
>>>>>>>>>> *ring)
>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt);
>>>>>>>>>>
>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt);
>>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
>>>>>>>>>>
>>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
>>>>>>>>>>                                 VCN_IDLE_TIMEOUT);
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
>>>>>>>>>>
>>>>>>>>>>           uint16_t inst_mask;
>>>>>>>>>>           uint8_t num_inst_per_aid;
>>>>>>>>>> +     atomic_t                total_submission_cnt;
>>>>>>>>>>
>>>>>>>>>>           /* IP reg dump */
>>>>>>>>>>           uint32_t                *ip_dump;