[PATCH 1/4] drm/amdgpu/vcn: fix race condition issue for vcn start
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Mar 4 15:03:03 UTC 2020
Am 04.03.20 um 15:57 schrieb James Zhu:
>
> On 2020-03-04 3:53 a.m., Christian König wrote:
>> Am 03.03.20 um 23:48 schrieb James Zhu:
>>>
>>> On 2020-03-03 2:03 p.m., James Zhu wrote:
>>>>
>>>> On 2020-03-03 1:44 p.m., Christian König wrote:
>>>>> Am 03.03.20 um 19:16 schrieb James Zhu:
>>>>>> Fix race condition issue when multiple vcn starts are called.
>>>>>>
>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
>>>>>> 2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>> index f96464e..aa7663f 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>> @@ -63,6 +63,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
>>>>>> int i, r;
>>>>>> INIT_DELAYED_WORK(&adev->vcn.idle_work,
>>>>>> amdgpu_vcn_idle_work_handler);
>>>>>> + mutex_init(&adev->vcn.vcn_pg_lock);
>>>>>> switch (adev->asic_type) {
>>>>>> case CHIP_RAVEN:
>>>>>> @@ -210,6 +211,7 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device
>>>>>> *adev)
>>>>>> }
>>>>>> release_firmware(adev->vcn.fw);
>>>>>> + mutex_destroy(&adev->vcn.vcn_pg_lock);
>>>>>> return 0;
>>>>>> }
>>>>>> @@ -321,6 +323,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>> amdgpu_ring *ring)
>>>>>> struct amdgpu_device *adev = ring->adev;
>>>>>> bool set_clocks =
>>>>>> !cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>>>> + mutex_lock(&adev->vcn.vcn_pg_lock);
>>>>>
>>>>> That still won't work correctly here.
>>>>>
>>>>> The whole idea of the cancel_delayed_work_sync() and
>>>>> schedule_delayed_work() dance is that you have exactly one user of
>>>>> that. If you have multiple rings that whole thing won't work
>>>>> correctly.
>>>>>
>>>>> To fix this you need to call mutex_lock() before
>>>>> cancel_delayed_work_sync() and schedule_delayed_work() before
>>>>> mutex_unlock().
>>>>
>>>> Big lock definitely works. I am trying to use as smaller lock as
>>>> possible here. the share resource which needs protect here are
>>>> power gate process and dpg mode switch process.
>>>>
>>>> if we move mutex_unlock() before schedule_delayed_work(. I am
>>>> wondering what are the other necessary resources which need protect.
>>>
>>> By the way, cancel_delayed_work_sync() supports multiple thread
>>> itself, so I didn't put it into protection area.
>>
>> Yeah, but that's correct but it still won't working correctly :)
>>
>> See the problem is that only for the first caller
>> cancel_delayed_work_sync() returns true because it canceled the
>> delayed work.
>
> if the 1st caller gets true. the 2nd caller unfortunately may miss
> this pending status, so it will ungate the power which is unexpected.
>
> But in power gate/ungate function, a power state is maintained, so
> this miss won't be really triggered to ungate the power.
>
> So I think cancel_delayed_work_sync() / schedule_delayed_work() are
> not necessary be protected here.
Ok that could work as well.
But in this case I would remove checking the return value of
cancel_delayed_work_sync() and just always ungate the power.
This way we prevent ugly bugs from leaking in when this really races
sometimes.
Regards,
Christian.
>
> Best Regards!
>
> James
>
>>
>> For all others it returns false and those would then think that they
>> need to ungate the power.
>>
>> The only solution I see is to either put both the
>> cancel_delayed_work_sync() and schedule_delayed_work() under the same
>> mutex protection or start to use an atomic or other counter to note
>> concurrent processing.
>>
>>> power gate is shared by all VCN IP instances and different rings ,
>>> so it needs be put into protection area.
>>>
>>> each ring's job itself is serialized by scheduler. it doesn't need
>>> be put into this protection area.
>>
>> Yes, those should work as expected.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks!
>>>>
>>>> James
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> if (set_clocks) {
>>>>>> amdgpu_gfx_off_ctrl(adev, false);
>>>>>> amdgpu_device_ip_set_powergating_state(adev,
>>>>>> AMD_IP_BLOCK_TYPE_VCN,
>>>>>> @@ -345,6 +348,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>> amdgpu_ring *ring)
>>>>>> adev->vcn.pause_dpg_mode(adev, ring->me, &new_state);
>>>>>> }
>>>>>> + mutex_unlock(&adev->vcn.vcn_pg_lock);
>>>>>> }
>>>>>> void amdgpu_vcn_ring_end_use(struct amdgpu_ring *ring)
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>> index 6fe0573..2ae110d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>> @@ -200,6 +200,7 @@ struct amdgpu_vcn {
>>>>>> struct drm_gpu_scheduler
>>>>>> *vcn_dec_sched[AMDGPU_MAX_VCN_INSTANCES];
>>>>>> uint32_t num_vcn_enc_sched;
>>>>>> uint32_t num_vcn_dec_sched;
>>>>>> + struct mutex vcn_pg_lock;
>>>>>> unsigned harvest_config;
>>>>>> int (*pause_dpg_mode)(struct amdgpu_device *adev,
>>>>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list