[PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)

Thu Aug 14 12:56:57 UTC 2025

On Wed, Aug 13, 2025 at 7:06 PM Wu, David <davidwu2 at amd.com> wrote:
>
> On 8/13/2025 6:11 PM, Alex Deucher wrote:
> > On Wed, Aug 13, 2025 at 5:47 PM Wu, David <davidwu2 at amd.com> wrote:
> >> On 8/13/2025 5:03 PM, Alex Deucher wrote:
> >>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
> >>> <sathishkumar.sundararaju at amd.com> wrote:
> >>>> On 8/14/2025 1:35 AM, Alex Deucher wrote:
> >>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
> >>>>> <sathishkumar.sundararaju at amd.com> wrote:
> >>>>>> Hi Alex, Hi David,
> >>>>>>
> >>>>>> I see David's concern but his suggestion yet wont solve the problem,
> >>>>>> neither the current form , reason :-
> >>>>>>
> >>>>>> The emitted fence count and total submission count are fast transients
> >>>>>> which frequently become 0 in between video decodes (between jobs) even
> >>>>>> with the atomics and locks there can be a switch of video power profile,
> >>>>>> in the current form of patch that window is minimized, but still can
> >>>>>> happen if stress tested. But power state of any instance becoming zero
> >>>>> Can you explain how this can happen?  I'm not seeing it.
> >>>> Consider this situation, inst0 and inst1 actively decoding, inst0 decode
> >>>> completes, delayed idle work starts.
> >>>> inst0 idle handler can read 0 total fences and 0 total submission count,
> >>>> even if inst1 is actively decoding,
> >>>> that's between the jobs,
> >>>>     - as begin_use increaments vcn.total_submission_cnt and end_use
> >>>> decreaments vcn.total_submission_cnt that can be 0.
> >>>>     - if outstanding fences are cleared and no new emitted fence, between
> >>>> jobs , can be 0.
> >>>>     - both of the above conditions do not mean video decode is complete on
> >>>> inst1, it is actively decoding.
> >>> How can there be active decoding without an outstanding fence?  In
> >>> that case, total_fences (fences from both instances) would be non-0.
> >> I think it should be non-0.
> >> I do see a hiccup possible - i.e the power switching from ON to OFF then
> >> ON in the
> >> middle of decoding, i.e inst0 idle handler turns it off then inst1 turns
> >> it on.
> > How would that happen? As long as there submission cnt is non-0 and
> > there are outstanding fences on any instance, the video profile will
> > stay active.
> there could be no jobs but it doesn't timeout yet and new jobs will come in
> any ms - note all fences are done at this time. The idle handler sees no
> fences
> and no jobs so it turns off the power - but just ms later a new job is
> submitted
> from the same decode session which could be mpv player as it does not
> need to
> submit jobs without delays. This will turn on the power.

I'm not following.  Every submission will start with begin_use().

Alex

> David
> > Alex
> >
> >> We should avoid this glitch. This requires the idle handler sets/clears
> >> a flag for
> >> done for this instance as Sathish's original patch. When all instances
> >> set/clear the
> >> flag then we can safely power off.
> >> David
> >>> Alex
> >>>
> >>>> Whereas if instances are powered off we are sure idle time is past and
> >>>> it is powered off, no possible way of
> >>>> active video decode, when all instances are off we can safely assume no
> >>>> active decode and global lock protects
> >>>> it against new begin_use on any instance. But the only distant concern
> >>>> is global common locks w.r.t perf, but we
> >>>> are already having a global workprofile mutex , so there shouldn't be
> >>>> any drop in perf, with just one single
> >>>> global lock for all instances.
> >>>>
> >>>> Just sending out a patch with this fix, will leave it to you to decide
> >>>> the right method. If you think outstanding total fences
> >>>> can never be 0 during decode, then your previous version (v3) itself is
> >>>> good, there is no real benefit of splitting the handlers as such.
> >>>>
> >>>> Regards,
> >>>> Sathish
> >>>>> If it is possible, maybe it would be easier to just split the profile
> >>>>> and powergating into separate handlers.  The profile one would be
> >>>>> global and the powergating one would be per instance.  See the
> >>>>> attached patches.
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>>> can be a sure shot indication of break in a video decode, the mistake in
> >>>>>> my patch was using per instance mutex, I should have used a common
> >>>>>> global mutex, then that covers the situation David is trying to bring out.
> >>>>>>
> >>>>>> Using one global vcn.pg_lock for idle and begin_use and using flags to
> >>>>>> track power state could help us totally avoid this situation.
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Sathish
> >>>>>>
> >>>>>> On 8/13/2025 11:46 PM, Wu, David wrote:
> >>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
> >>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <davidwu2 at amd.com> wrote:
> >>>>>>>>> Hi Alex,
> >>>>>>>>>
> >>>>>>>>> The addition of  total_submission_cnt should work - in that
> >>>>>>>>> it is unlikely to have a context switch right after the begin_use().
> >>>>>>>>> The suggestion of moving it inside the lock (which I prefer in case
> >>>>>>>>> someone
> >>>>>>>>> adds more before the lock and not reviewed thoroughly)
> >>>>>>>>>       - up to you to decide.
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> David
> >>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
> >>>>>>>>>> If there are multiple instances of the VCN running,
> >>>>>>>>>> we may end up switching the video profile while another
> >>>>>>>>>> instance is active because we only take into account
> >>>>>>>>>> the current instance's submissions.  Look at all
> >>>>>>>>>> outstanding fences for the video profile.
> >>>>>>>>>>
> >>>>>>>>>> v2: drop early exit in begin_use()
> >>>>>>>>>> v3: handle possible race between begin_use() work handler
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
> >>>>>>>>>> handling")
> >>>>>>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju at amd.com> (v1)
> >>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
> >>>>>>>>>> ++++++++++++-------------
> >>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
> >>>>>>>>>>       2 files changed, 21 insertions(+), 20 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>>>> @@ -415,19 +415,25 @@ static void
> >>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
> >>>>>>>>>>           struct amdgpu_vcn_inst *vcn_inst =
> >>>>>>>>>>                   container_of(work, struct amdgpu_vcn_inst,
> >>>>>>>>>> idle_work.work);
> >>>>>>>>>>           struct amdgpu_device *adev = vcn_inst->adev;
> >>>>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
> >>>>>>>>>> -     unsigned int i = vcn_inst->inst, j;
> >>>>>>>>>> +     unsigned int total_fences = 0,
> >>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
> >>>>>>>>>> +     unsigned int i, j;
> >>>>>>>>>>           int r = 0;
> >>>>>>>>>>
> >>>>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
> >>>>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
> >>>>>>>>>>                   return;
> >>>>>>>>>>
> >>>>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
> >>>>>>>>>> -             fence[i] +=
> >>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
> >>>>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> >>>>>>>>>> +             struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
> >>>>>>>>>> +
> >>>>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
> >>>>>>>>>> +                     fence[i] +=
> >>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
> >>>>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
> >>>>>>>>>> +             total_fences += fence[i];
> >>>>>>>>>> +     }
> >>>>>>>>>>
> >>>>>>>>>>           /* Only set DPG pause for VCN3 or below, VCN4 and above will
> >>>>>>>>>> be handled by FW */
> >>>>>>>>>>           if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
> >>>>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
> >>>>>>>>>> +         !vcn_inst->using_unified_queue) {
> >>>>>>>>>>                   struct dpg_pause_state new_state;
> >>>>>>>>>>
> >>>>>>>>>>                   if (fence[i] ||
> >>>>>>>>>> @@ -436,18 +442,18 @@ static void
> >>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
> >>>>>>>>>>                   else
> >>>>>>>>>>                           new_state.fw_based = VCN_DPG_STATE__UNPAUSE;
> >>>>>>>>>>
> >>>>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
> >>>>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
> >>>>>>>>>>           }
> >>>>>>>>>>
> >>>>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
> >>>>>>>>>> -     fences += fence[i];
> >>>>>>>>>> -
> >>>>>>>>>> -     if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
> >>>>>>>>>> +     if (!fence[vcn_inst->inst] &&
> >>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
> >>>>>>>>>> +             /* This is specific to this instance */
> >>>>>>>>>>                   mutex_lock(&vcn_inst->vcn_pg_lock);
> >>>>>>>>>>                   vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
> >>>>>>>>>>                   mutex_unlock(&vcn_inst->vcn_pg_lock);
> >>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
> >>>>>>>>>> -             if (adev->vcn.workload_profile_active) {
> >>>>>>>>>> +             /* This is global and depends on all VCN instances */
> >>>>>>>>>> +             if (adev->vcn.workload_profile_active &&
> >>>>>>>>>> !total_fences &&
> >>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
> >>>>>>>>>>                           r = amdgpu_dpm_switch_power_profile(adev,
> >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
> >>>>>>>>>> false);
> >>>>>>>>>>                           if (r)
> >>>>>>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct
> >>>>>>>>>> amdgpu_ring *ring)
> >>>>>>>>>>           int r = 0;
> >>>>>>>>>>
> >>>>>>>>>>           atomic_inc(&vcn_inst->total_submission_cnt);
> >>>>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
> >>>>>>>>> move this addition down inside the mutex lock
> >>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
> >>>>>>>>>>
> >>>>>>>>>> -     /* We can safely return early here because we've cancelled the
> >>>>>>>>>> -      * the delayed work so there is no one else to set it to false
> >>>>>>>>>> -      * and we don't care if someone else sets it to true.
> >>>>>>>>>> -      */
> >>>>>>>>>> -     if (adev->vcn.workload_profile_active)
> >>>>>>>>>> -             goto pg_lock;
> >>>>>>>>>> -
> >>>>>>>>>>           mutex_lock(&adev->vcn.workload_profile_mutex);
> >>>>>>>>> move to here:
> >>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
> >>>>>>>>> I think this should work for multiple instances.
> >>>>>>>> Why does this need to be protected by the mutex?
> >>>>>>> hmm.. OK - no need and it is actually better before the mutex.
> >>>>>>> David
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>>> David
> >>>>>>>>>>           if (!adev->vcn.workload_profile_active) {
> >>>>>>>>>>                   r = amdgpu_dpm_switch_power_profile(adev,
> >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
> >>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
> >>>>>>>>>> amdgpu_ring *ring)
> >>>>>>>>>>           }
> >>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
> >>>>>>>>>>
> >>>>>>>>>> -pg_lock:
> >>>>>>>>>>           mutex_lock(&vcn_inst->vcn_pg_lock);
> >>>>>>>>>>           vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE);
> >>>>>>>>>>
> >>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring
> >>>>>>>>>> *ring)
> >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt);
> >>>>>>>>>>
> >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt);
> >>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
> >>>>>>>>>>
> >>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
> >>>>>>>>>>                                 VCN_IDLE_TIMEOUT);
> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
> >>>>>>>>>>
> >>>>>>>>>>           uint16_t inst_mask;
> >>>>>>>>>>           uint8_t num_inst_per_aid;
> >>>>>>>>>> +     atomic_t                total_submission_cnt;
> >>>>>>>>>>
> >>>>>>>>>>           /* IP reg dump */
> >>>>>>>>>>           uint32_t                *ip_dump;
>