[PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)

Wed Aug 13 22:11:39 UTC 2025

On Wed, Aug 13, 2025 at 5:47 PM Wu, David <davidwu2 at amd.com> wrote:
>
> On 8/13/2025 5:03 PM, Alex Deucher wrote:
> > On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
> > <sathishkumar.sundararaju at amd.com> wrote:
> >>
> >> On 8/14/2025 1:35 AM, Alex Deucher wrote:
> >>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
> >>> <sathishkumar.sundararaju at amd.com> wrote:
> >>>> Hi Alex, Hi David,
> >>>>
> >>>> I see David's concern but his suggestion yet wont solve the problem,
> >>>> neither the current form , reason :-
> >>>>
> >>>> The emitted fence count and total submission count are fast transients
> >>>> which frequently become 0 in between video decodes (between jobs) even
> >>>> with the atomics and locks there can be a switch of video power profile,
> >>>> in the current form of patch that window is minimized, but still can
> >>>> happen if stress tested. But power state of any instance becoming zero
> >>> Can you explain how this can happen?  I'm not seeing it.
> >> Consider this situation, inst0 and inst1 actively decoding, inst0 decode
> >> completes, delayed idle work starts.
> >> inst0 idle handler can read 0 total fences and 0 total submission count,
> >> even if inst1 is actively decoding,
> >> that's between the jobs,
> >>    - as begin_use increaments vcn.total_submission_cnt and end_use
> >> decreaments vcn.total_submission_cnt that can be 0.
> >>    - if outstanding fences are cleared and no new emitted fence, between
> >> jobs , can be 0.
> >>    - both of the above conditions do not mean video decode is complete on
> >> inst1, it is actively decoding.
> > How can there be active decoding without an outstanding fence?  In
> > that case, total_fences (fences from both instances) would be non-0.
> I think it should be non-0.
> I do see a hiccup possible - i.e the power switching from ON to OFF then
> ON in the
> middle of decoding, i.e inst0 idle handler turns it off then inst1 turns
> it on.

How would that happen? As long as there submission cnt is non-0 and
there are outstanding fences on any instance, the video profile will
stay active.

Alex

> We should avoid this glitch. This requires the idle handler sets/clears
> a flag for
> done for this instance as Sathish's original patch. When all instances
> set/clear the
> flag then we can safely power off.
> David
> > Alex
> >
> >> Whereas if instances are powered off we are sure idle time is past and
> >> it is powered off, no possible way of
> >> active video decode, when all instances are off we can safely assume no
> >> active decode and global lock protects
> >> it against new begin_use on any instance. But the only distant concern
> >> is global common locks w.r.t perf, but we
> >> are already having a global workprofile mutex , so there shouldn't be
> >> any drop in perf, with just one single
> >> global lock for all instances.
> >>
> >> Just sending out a patch with this fix, will leave it to you to decide
> >> the right method. If you think outstanding total fences
> >> can never be 0 during decode, then your previous version (v3) itself is
> >> good, there is no real benefit of splitting the handlers as such.
> >>
> >> Regards,
> >> Sathish
> >>> If it is possible, maybe it would be easier to just split the profile
> >>> and powergating into separate handlers.  The profile one would be
> >>> global and the powergating one would be per instance.  See the
> >>> attached patches.
> >>>
> >>> Alex
> >>>
> >>>> can be a sure shot indication of break in a video decode, the mistake in
> >>>> my patch was using per instance mutex, I should have used a common
> >>>> global mutex, then that covers the situation David is trying to bring out.
> >>>>
> >>>> Using one global vcn.pg_lock for idle and begin_use and using flags to
> >>>> track power state could help us totally avoid this situation.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Sathish
> >>>>
> >>>> On 8/13/2025 11:46 PM, Wu, David wrote:
> >>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
> >>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <davidwu2 at amd.com> wrote:
> >>>>>>> Hi Alex,
> >>>>>>>
> >>>>>>> The addition of  total_submission_cnt should work - in that
> >>>>>>> it is unlikely to have a context switch right after the begin_use().
> >>>>>>> The suggestion of moving it inside the lock (which I prefer in case
> >>>>>>> someone
> >>>>>>> adds more before the lock and not reviewed thoroughly)
> >>>>>>>      - up to you to decide.
> >>>>>>>
> >>>>>>> Reviewed-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> David
> >>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
> >>>>>>>> If there are multiple instances of the VCN running,
> >>>>>>>> we may end up switching the video profile while another
> >>>>>>>> instance is active because we only take into account
> >>>>>>>> the current instance's submissions.  Look at all
> >>>>>>>> outstanding fences for the video profile.
> >>>>>>>>
> >>>>>>>> v2: drop early exit in begin_use()
> >>>>>>>> v3: handle possible race between begin_use() work handler
> >>>>>>>>
> >>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
> >>>>>>>> handling")
> >>>>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju at amd.com> (v1)
> >>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
> >>>>>>>> ++++++++++++-------------
> >>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
> >>>>>>>>      2 files changed, 21 insertions(+), 20 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>>>>>> @@ -415,19 +415,25 @@ static void
> >>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
> >>>>>>>>          struct amdgpu_vcn_inst *vcn_inst =
> >>>>>>>>                  container_of(work, struct amdgpu_vcn_inst,
> >>>>>>>> idle_work.work);
> >>>>>>>>          struct amdgpu_device *adev = vcn_inst->adev;
> >>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
> >>>>>>>> -     unsigned int i = vcn_inst->inst, j;
> >>>>>>>> +     unsigned int total_fences = 0,
> >>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
> >>>>>>>> +     unsigned int i, j;
> >>>>>>>>          int r = 0;
> >>>>>>>>
> >>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
> >>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
> >>>>>>>>                  return;
> >>>>>>>>
> >>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
> >>>>>>>> -             fence[i] +=
> >>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
> >>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> >>>>>>>> +             struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
> >>>>>>>> +
> >>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
> >>>>>>>> +                     fence[i] +=
> >>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
> >>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
> >>>>>>>> +             total_fences += fence[i];
> >>>>>>>> +     }
> >>>>>>>>
> >>>>>>>>          /* Only set DPG pause for VCN3 or below, VCN4 and above will
> >>>>>>>> be handled by FW */
> >>>>>>>>          if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
> >>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
> >>>>>>>> +         !vcn_inst->using_unified_queue) {
> >>>>>>>>                  struct dpg_pause_state new_state;
> >>>>>>>>
> >>>>>>>>                  if (fence[i] ||
> >>>>>>>> @@ -436,18 +442,18 @@ static void
> >>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
> >>>>>>>>                  else
> >>>>>>>>                          new_state.fw_based = VCN_DPG_STATE__UNPAUSE;
> >>>>>>>>
> >>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
> >>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
> >>>>>>>>          }
> >>>>>>>>
> >>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
> >>>>>>>> -     fences += fence[i];
> >>>>>>>> -
> >>>>>>>> -     if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
> >>>>>>>> +     if (!fence[vcn_inst->inst] &&
> >>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
> >>>>>>>> +             /* This is specific to this instance */
> >>>>>>>>                  mutex_lock(&vcn_inst->vcn_pg_lock);
> >>>>>>>>                  vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
> >>>>>>>>                  mutex_unlock(&vcn_inst->vcn_pg_lock);
> >>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
> >>>>>>>> -             if (adev->vcn.workload_profile_active) {
> >>>>>>>> +             /* This is global and depends on all VCN instances */
> >>>>>>>> +             if (adev->vcn.workload_profile_active &&
> >>>>>>>> !total_fences &&
> >>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
> >>>>>>>>                          r = amdgpu_dpm_switch_power_profile(adev,
> >>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
> >>>>>>>> false);
> >>>>>>>>                          if (r)
> >>>>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct
> >>>>>>>> amdgpu_ring *ring)
> >>>>>>>>          int r = 0;
> >>>>>>>>
> >>>>>>>>          atomic_inc(&vcn_inst->total_submission_cnt);
> >>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
> >>>>>>> move this addition down inside the mutex lock
> >>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
> >>>>>>>>
> >>>>>>>> -     /* We can safely return early here because we've cancelled the
> >>>>>>>> -      * the delayed work so there is no one else to set it to false
> >>>>>>>> -      * and we don't care if someone else sets it to true.
> >>>>>>>> -      */
> >>>>>>>> -     if (adev->vcn.workload_profile_active)
> >>>>>>>> -             goto pg_lock;
> >>>>>>>> -
> >>>>>>>>          mutex_lock(&adev->vcn.workload_profile_mutex);
> >>>>>>> move to here:
> >>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
> >>>>>>> I think this should work for multiple instances.
> >>>>>> Why does this need to be protected by the mutex?
> >>>>> hmm.. OK - no need and it is actually better before the mutex.
> >>>>> David
> >>>>>> Alex
> >>>>>>
> >>>>>>> David
> >>>>>>>>          if (!adev->vcn.workload_profile_active) {
> >>>>>>>>                  r = amdgpu_dpm_switch_power_profile(adev,
> >>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
> >>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
> >>>>>>>> amdgpu_ring *ring)
> >>>>>>>>          }
> >>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
> >>>>>>>>
> >>>>>>>> -pg_lock:
> >>>>>>>>          mutex_lock(&vcn_inst->vcn_pg_lock);
> >>>>>>>>          vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE);
> >>>>>>>>
> >>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring
> >>>>>>>> *ring)
> >>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt);
> >>>>>>>>
> >>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt);
> >>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
> >>>>>>>>
> >>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
> >>>>>>>>                                VCN_IDLE_TIMEOUT);
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> >>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
> >>>>>>>>
> >>>>>>>>          uint16_t inst_mask;
> >>>>>>>>          uint8_t num_inst_per_aid;
> >>>>>>>> +     atomic_t                total_submission_cnt;
> >>>>>>>>
> >>>>>>>>          /* IP reg dump */
> >>>>>>>>          uint32_t                *ip_dump;
>