[PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)

Thu Aug 14 16:44:16 UTC 2025

On 2025-08-14 12:01, Alex Deucher wrote:
> On Thu, Aug 14, 2025 at 11:35 AM David Wu <davidwu2 at amd.com> wrote:
>> On 2025-08-14 08:56, Alex Deucher wrote:
>>> On Wed, Aug 13, 2025 at 7:06 PM Wu, David <davidwu2 at amd.com> wrote:
>>>> On 8/13/2025 6:11 PM, Alex Deucher wrote:
>>>>> On Wed, Aug 13, 2025 at 5:47 PM Wu, David <davidwu2 at amd.com> wrote:
>>>>>> On 8/13/2025 5:03 PM, Alex Deucher wrote:
>>>>>>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
>>>>>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>>>>>> On 8/14/2025 1:35 AM, Alex Deucher wrote:
>>>>>>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
>>>>>>>>> <sathishkumar.sundararaju at amd.com> wrote:
>>>>>>>>>> Hi Alex, Hi David,
>>>>>>>>>>
>>>>>>>>>> I see David's concern but his suggestion yet wont solve the problem,
>>>>>>>>>> neither the current form , reason :-
>>>>>>>>>>
>>>>>>>>>> The emitted fence count and total submission count are fast transients
>>>>>>>>>> which frequently become 0 in between video decodes (between jobs) even
>>>>>>>>>> with the atomics and locks there can be a switch of video power profile,
>>>>>>>>>> in the current form of patch that window is minimized, but still can
>>>>>>>>>> happen if stress tested. But power state of any instance becoming zero
>>>>>>>>> Can you explain how this can happen?  I'm not seeing it.
>>>>>>>> Consider this situation, inst0 and inst1 actively decoding, inst0 decode
>>>>>>>> completes, delayed idle work starts.
>>>>>>>> inst0 idle handler can read 0 total fences and 0 total submission count,
>>>>>>>> even if inst1 is actively decoding,
>>>>>>>> that's between the jobs,
>>>>>>>>       - as begin_use increaments vcn.total_submission_cnt and end_use
>>>>>>>> decreaments vcn.total_submission_cnt that can be 0.
>>>>>>>>       - if outstanding fences are cleared and no new emitted fence, between
>>>>>>>> jobs , can be 0.
>>>>>>>>       - both of the above conditions do not mean video decode is complete on
>>>>>>>> inst1, it is actively decoding.
>>>>>>> How can there be active decoding without an outstanding fence?  In
>>>>>>> that case, total_fences (fences from both instances) would be non-0.
>>>>>> I think it should be non-0.
>>>>>> I do see a hiccup possible - i.e the power switching from ON to OFF then
>>>>>> ON in the
>>>>>> middle of decoding, i.e inst0 idle handler turns it off then inst1 turns
>>>>>> it on.
>>>>> How would that happen? As long as there submission cnt is non-0 and
>>>>> there are outstanding fences on any instance, the video profile will
>>>>> stay active.
>>>> there could be no jobs but it doesn't timeout yet and new jobs will come in
>>>> any ms - note all fences are done at this time. The idle handler sees no
>>>> fences
>>>> and no jobs so it turns off the power - but just ms later a new job is
>>>> submitted
>>>> from the same decode session which could be mpv player as it does not
>>>> need to
>>>> submit jobs without delays. This will turn on the power.
>>> I'm not following.  Every submission will start with begin_use().
>> yes - it does - it can power on vcn but this happens in the middle of a
>> decode session which has 10s timeout to call its own idle handler - in fact
>> the other instance's idle handler will power off vcn because it does not
>> know it needs
>> to wait until the decoding session times out.
> I don't follow.  If there are no outstanding fences, there is no
> reason to not power down the VCN instance and disable the video
> profile. If there are still outstanding fences, then the VCN instance
> those fences are associated with will stay on and the video profile
> will stay enabled.  If the engine hangs and eventually gets reset, the
> fence will be signalled and then there will be no outstanding fences
> so the idle handler will eventually disable the power profile.  The
> idle handler will keep getting rescheduled as long as there is still
> oustanding work.
inst0 and inst1:
inst0 sends jobA, then ends jobA and no more job submitted in 500ms and
job queue is empty - at this point  inst1's idle handler sees no 
outstanding fences/jobs
then power off.  However inst0 starts to submit job after 500ms - inst0' 
idle handler
has not started/scheduled to run but inst1's has finished already which
does not know inst0 has not timed out or called its own idle handler. 
This violates the
logic for idle handler's timeout condition. (i.e 10s timeout designed 
but timed out in 500ms)
all this means it powered down too early for inst0.
> Alex
>
>
>>> Alex
>>>
>>>> David
>>>>> Alex
>>>>>
>>>>>> We should avoid this glitch. This requires the idle handler sets/clears
>>>>>> a flag for
>>>>>> done for this instance as Sathish's original patch. When all instances
>>>>>> set/clear the
>>>>>> flag then we can safely power off.
>>>>>> David
>>>>>>> Alex
>>>>>>>
>>>>>>>> Whereas if instances are powered off we are sure idle time is past and
>>>>>>>> it is powered off, no possible way of
>>>>>>>> active video decode, when all instances are off we can safely assume no
>>>>>>>> active decode and global lock protects
>>>>>>>> it against new begin_use on any instance. But the only distant concern
>>>>>>>> is global common locks w.r.t perf, but we
>>>>>>>> are already having a global workprofile mutex , so there shouldn't be
>>>>>>>> any drop in perf, with just one single
>>>>>>>> global lock for all instances.
>>>>>>>>
>>>>>>>> Just sending out a patch with this fix, will leave it to you to decide
>>>>>>>> the right method. If you think outstanding total fences
>>>>>>>> can never be 0 during decode, then your previous version (v3) itself is
>>>>>>>> good, there is no real benefit of splitting the handlers as such.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sathish
>>>>>>>>> If it is possible, maybe it would be easier to just split the profile
>>>>>>>>> and powergating into separate handlers.  The profile one would be
>>>>>>>>> global and the powergating one would be per instance.  See the
>>>>>>>>> attached patches.
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>> can be a sure shot indication of break in a video decode, the mistake in
>>>>>>>>>> my patch was using per instance mutex, I should have used a common
>>>>>>>>>> global mutex, then that covers the situation David is trying to bring out.
>>>>>>>>>>
>>>>>>>>>> Using one global vcn.pg_lock for idle and begin_use and using flags to
>>>>>>>>>> track power state could help us totally avoid this situation.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Sathish
>>>>>>>>>>
>>>>>>>>>> On 8/13/2025 11:46 PM, Wu, David wrote:
>>>>>>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
>>>>>>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <davidwu2 at amd.com> wrote:
>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The addition of  total_submission_cnt should work - in that
>>>>>>>>>>>>> it is unlikely to have a context switch right after the begin_use().
>>>>>>>>>>>>> The suggestion of moving it inside the lock (which I prefer in case
>>>>>>>>>>>>> someone
>>>>>>>>>>>>> adds more before the lock and not reviewed thoroughly)
>>>>>>>>>>>>>         - up to you to decide.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <David.Wu3 at amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> David
>>>>>>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
>>>>>>>>>>>>>> If there are multiple instances of the VCN running,
>>>>>>>>>>>>>> we may end up switching the video profile while another
>>>>>>>>>>>>>> instance is active because we only take into account
>>>>>>>>>>>>>> the current instance's submissions.  Look at all
>>>>>>>>>>>>>> outstanding fences for the video profile.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> v2: drop early exit in begin_use()
>>>>>>>>>>>>>> v3: handle possible race between begin_use() work handler
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
>>>>>>>>>>>>>> handling")
>>>>>>>>>>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju at amd.com> (v1)
>>>>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
>>>>>>>>>>>>>> ++++++++++++-------------
>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
>>>>>>>>>>>>>>         2 files changed, 21 insertions(+), 20 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>>>> @@ -415,19 +415,25 @@ static void
>>>>>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>>>>>             struct amdgpu_vcn_inst *vcn_inst =
>>>>>>>>>>>>>>                     container_of(work, struct amdgpu_vcn_inst,
>>>>>>>>>>>>>> idle_work.work);
>>>>>>>>>>>>>>             struct amdgpu_device *adev = vcn_inst->adev;
>>>>>>>>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>>>>>> -     unsigned int i = vcn_inst->inst, j;
>>>>>>>>>>>>>> +     unsigned int total_fences = 0,
>>>>>>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>>>>>> +     unsigned int i, j;
>>>>>>>>>>>>>>             int r = 0;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
>>>>>>>>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
>>>>>>>>>>>>>>                     return;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
>>>>>>>>>>>>>> -             fence[i] +=
>>>>>>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
>>>>>>>>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>>>>>>>>>>>>> +             struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
>>>>>>>>>>>>>> +                     fence[i] +=
>>>>>>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
>>>>>>>>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
>>>>>>>>>>>>>> +             total_fences += fence[i];
>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             /* Only set DPG pause for VCN3 or below, VCN4 and above will
>>>>>>>>>>>>>> be handled by FW */
>>>>>>>>>>>>>>             if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
>>>>>>>>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
>>>>>>>>>>>>>> +         !vcn_inst->using_unified_queue) {
>>>>>>>>>>>>>>                     struct dpg_pause_state new_state;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                     if (fence[i] ||
>>>>>>>>>>>>>> @@ -436,18 +442,18 @@ static void
>>>>>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>>>>>                     else
>>>>>>>>>>>>>>                             new_state.fw_based = VCN_DPG_STATE__UNPAUSE;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
>>>>>>>>>>>>>> -     fences += fence[i];
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -     if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>>>>> +     if (!fence[vcn_inst->inst] &&
>>>>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>>>>> +             /* This is specific to this instance */
>>>>>>>>>>>>>>                     mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>>>>                     vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
>>>>>>>>>>>>>>                     mutex_unlock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>>>> -             if (adev->vcn.workload_profile_active) {
>>>>>>>>>>>>>> +             /* This is global and depends on all VCN instances */
>>>>>>>>>>>>>> +             if (adev->vcn.workload_profile_active &&
>>>>>>>>>>>>>> !total_fences &&
>>>>>>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
>>>>>>>>>>>>>>                             r = amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>                             if (r)
>>>>>>>>>>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>>>>>>>>> amdgpu_ring *ring)
>>>>>>>>>>>>>>             int r = 0;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             atomic_inc(&vcn_inst->total_submission_cnt);
>>>>>>>>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>>>>> move this addition down inside the mutex lock
>>>>>>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -     /* We can safely return early here because we've cancelled the
>>>>>>>>>>>>>> -      * the delayed work so there is no one else to set it to false
>>>>>>>>>>>>>> -      * and we don't care if someone else sets it to true.
>>>>>>>>>>>>>> -      */
>>>>>>>>>>>>>> -     if (adev->vcn.workload_profile_active)
>>>>>>>>>>>>>> -             goto pg_lock;
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>             mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>>> move to here:
>>>>>>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>>>>> I think this should work for multiple instances.
>>>>>>>>>>>> Why does this need to be protected by the mutex?
>>>>>>>>>>> hmm.. OK - no need and it is actually better before the mutex.
>>>>>>>>>>> David
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>>> David
>>>>>>>>>>>>>>             if (!adev->vcn.workload_profile_active) {
>>>>>>>>>>>>>>                     r = amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO,
>>>>>>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>>>>>>>>> amdgpu_ring *ring)
>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -pg_lock:
>>>>>>>>>>>>>>             mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>>>>             vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring
>>>>>>>>>>>>>> *ring)
>>>>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt);
>>>>>>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
>>>>>>>>>>>>>>                                   VCN_IDLE_TIMEOUT);
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             uint16_t inst_mask;
>>>>>>>>>>>>>>             uint8_t num_inst_per_aid;
>>>>>>>>>>>>>> +     atomic_t                total_submission_cnt;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             /* IP reg dump */
>>>>>>>>>>>>>>             uint32_t                *ip_dump;