[PATCH 1/4] drm/amdgpu/vcn: fix race condition issue for vcn start
James Zhu
jamesz at amd.com
Wed Mar 4 15:09:26 UTC 2020
On 2020-03-04 10:03 a.m., Christian König wrote:
> Am 04.03.20 um 15:57 schrieb James Zhu:
>>
>> On 2020-03-04 3:53 a.m., Christian König wrote:
>>> Am 03.03.20 um 23:48 schrieb James Zhu:
>>>>
>>>> On 2020-03-03 2:03 p.m., James Zhu wrote:
>>>>>
>>>>> On 2020-03-03 1:44 p.m., Christian König wrote:
>>>>>> Am 03.03.20 um 19:16 schrieb James Zhu:
>>>>>>> Fix race condition issue when multiple vcn starts are called.
>>>>>>>
>>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
>>>>>>> 2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> index f96464e..aa7663f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> @@ -63,6 +63,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device
>>>>>>> *adev)
>>>>>>> int i, r;
>>>>>>> INIT_DELAYED_WORK(&adev->vcn.idle_work,
>>>>>>> amdgpu_vcn_idle_work_handler);
>>>>>>> + mutex_init(&adev->vcn.vcn_pg_lock);
>>>>>>> switch (adev->asic_type) {
>>>>>>> case CHIP_RAVEN:
>>>>>>> @@ -210,6 +211,7 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device
>>>>>>> *adev)
>>>>>>> }
>>>>>>> release_firmware(adev->vcn.fw);
>>>>>>> + mutex_destroy(&adev->vcn.vcn_pg_lock);
>>>>>>> return 0;
>>>>>>> }
>>>>>>> @@ -321,6 +323,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>> amdgpu_ring *ring)
>>>>>>> struct amdgpu_device *adev = ring->adev;
>>>>>>> bool set_clocks =
>>>>>>> !cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>>>>> + mutex_lock(&adev->vcn.vcn_pg_lock);
>>>>>>
>>>>>> That still won't work correctly here.
>>>>>>
>>>>>> The whole idea of the cancel_delayed_work_sync() and
>>>>>> schedule_delayed_work() dance is that you have exactly one user
>>>>>> of that. If you have multiple rings that whole thing won't work
>>>>>> correctly.
>>>>>>
>>>>>> To fix this you need to call mutex_lock() before
>>>>>> cancel_delayed_work_sync() and schedule_delayed_work() before
>>>>>> mutex_unlock().
>>>>>
>>>>> Big lock definitely works. I am trying to use as smaller lock as
>>>>> possible here. the share resource which needs protect here are
>>>>> power gate process and dpg mode switch process.
>>>>>
>>>>> if we move mutex_unlock() before schedule_delayed_work(. I am
>>>>> wondering what are the other necessary resources which need protect.
>>>>
>>>> By the way, cancel_delayed_work_sync() supports multiple thread
>>>> itself, so I didn't put it into protection area.
>>>
>>> Yeah, but that's correct but it still won't working correctly :)
>>>
>>> See the problem is that only for the first caller
>>> cancel_delayed_work_sync() returns true because it canceled the
>>> delayed work.
>>
>> if the 1st caller gets true. the 2nd caller unfortunately may miss
>> this pending status, so it will ungate the power which is unexpected.
>>
>> But in power gate/ungate function, a power state is maintained, so
>> this miss won't be really triggered to ungate the power.
>>
>> So I think cancel_delayed_work_sync() / schedule_delayed_work() are
>> not necessary be protected here.
>
> Ok that could work as well.
>
> But in this case I would remove checking the return value of
> cancel_delayed_work_sync() and just always ungate the power.
>
> This way we prevent ugly bugs from leaking in when this really races
> sometimes.
Sure. Thanks!
James
>
> Regards,
> Christian.
>
>>
>> Best Regards!
>>
>> James
>>
>>>
>>> For all others it returns false and those would then think that they
>>> need to ungate the power.
>>>
>>> The only solution I see is to either put both the
>>> cancel_delayed_work_sync() and schedule_delayed_work() under the
>>> same mutex protection or start to use an atomic or other counter to
>>> note concurrent processing.
>>>
>>>> power gate is shared by all VCN IP instances and different rings ,
>>>> so it needs be put into protection area.
>>>>
>>>> each ring's job itself is serialized by scheduler. it doesn't need
>>>> be put into this protection area.
>>>
>>> Yes, those should work as expected.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> James
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> if (set_clocks) {
>>>>>>> amdgpu_gfx_off_ctrl(adev, false);
>>>>>>> amdgpu_device_ip_set_powergating_state(adev,
>>>>>>> AMD_IP_BLOCK_TYPE_VCN,
>>>>>>> @@ -345,6 +348,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>>>> amdgpu_ring *ring)
>>>>>>> adev->vcn.pause_dpg_mode(adev, ring->me, &new_state);
>>>>>>> }
>>>>>>> + mutex_unlock(&adev->vcn.vcn_pg_lock);
>>>>>>> }
>>>>>>> void amdgpu_vcn_ring_end_use(struct amdgpu_ring *ring)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> index 6fe0573..2ae110d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> @@ -200,6 +200,7 @@ struct amdgpu_vcn {
>>>>>>> struct drm_gpu_scheduler
>>>>>>> *vcn_dec_sched[AMDGPU_MAX_VCN_INSTANCES];
>>>>>>> uint32_t num_vcn_enc_sched;
>>>>>>> uint32_t num_vcn_dec_sched;
>>>>>>> + struct mutex vcn_pg_lock;
>>>>>>> unsigned harvest_config;
>>>>>>> int (*pause_dpg_mode)(struct amdgpu_device *adev,
>>>>>>
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CJames.Zhu%40amd.com%7Ce9f5dcb164bc4efacd7908d7c04d254f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637189309893957692&sdata=WcwmGejuagp5T9ojpSPLaU%2BmlHuidkh7SP3jDASpUSI%3D&reserved=0
>>
>
More information about the amd-gfx
mailing list