[PATCH 1/4] drm/amdgpu/vcn: fix race condition issue for vcn start

Wed Mar 4 15:09:26 UTC 2020

On 2020-03-04 10:03 a.m., Christian König wrote:
> Am 04.03.20 um 15:57 schrieb James Zhu:
>>
>> On 2020-03-04 3:53 a.m., Christian König wrote:
>>> Am 03.03.20 um 23:48 schrieb James Zhu:
>>>>
>>>> On 2020-03-03 2:03 p.m., James Zhu wrote:
>>>>>
>>>>> On 2020-03-03 1:44 p.m., Christian König wrote:
>>>>>> Am 03.03.20 um 19:16 schrieb James Zhu:
>>>>>>> Fix race condition issue when multiple vcn starts are called.
>>>>>>>
>>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++++
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
>>>>>>>   2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> index f96464e..aa7663f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>> @@ -63,6 +63,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device 
>>>>>>> *adev)
>>>>>>>       int i, r;
>>>>>>>         INIT_DELAYED_WORK(&adev->vcn.idle_work, 
>>>>>>> amdgpu_vcn_idle_work_handler);
>>>>>>> +    mutex_init(&adev->vcn.vcn_pg_lock);
>>>>>>>         switch (adev->asic_type) {
>>>>>>>       case CHIP_RAVEN:
>>>>>>> @@ -210,6 +211,7 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device 
>>>>>>> *adev)
>>>>>>>       }
>>>>>>>         release_firmware(adev->vcn.fw);
>>>>>>> +    mutex_destroy(&adev->vcn.vcn_pg_lock);
>>>>>>>         return 0;
>>>>>>>   }
>>>>>>> @@ -321,6 +323,7 @@ void amdgpu_vcn_ring_begin_use(struct 
>>>>>>> amdgpu_ring *ring)
>>>>>>>       struct amdgpu_device *adev = ring->adev;
>>>>>>>       bool set_clocks = 
>>>>>>> !cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>>>>>   +    mutex_lock(&adev->vcn.vcn_pg_lock);
>>>>>>
>>>>>> That still won't work correctly here.
>>>>>>
>>>>>> The whole idea of the cancel_delayed_work_sync() and 
>>>>>> schedule_delayed_work() dance is that you have exactly one user 
>>>>>> of that. If you have multiple rings that whole thing won't work 
>>>>>> correctly.
>>>>>>
>>>>>> To fix this you need to call mutex_lock() before 
>>>>>> cancel_delayed_work_sync() and schedule_delayed_work() before 
>>>>>> mutex_unlock().
>>>>>
>>>>> Big lock definitely works. I am trying to use as smaller lock as 
>>>>> possible here. the share resource which needs protect here are 
>>>>> power gate process and dpg mode switch process.
>>>>>
>>>>> if we move mutex_unlock() before schedule_delayed_work(. I am 
>>>>> wondering what are the other necessary resources which need protect.
>>>>
>>>> By the way, cancel_delayed_work_sync() supports multiple thread 
>>>> itself, so I didn't put it into protection area.
>>>
>>> Yeah, but that's correct but it still won't working correctly :)
>>>
>>> See the problem is that only for the first caller 
>>> cancel_delayed_work_sync() returns true because it canceled the 
>>> delayed work.
>>
>> if the 1st caller gets true. the 2nd caller unfortunately may miss 
>> this pending status, so it will ungate the power which is unexpected.
>>
>> But in power gate/ungate function, a power state is maintained, so 
>> this miss won't be really triggered to ungate the power.
>>
>> So I think cancel_delayed_work_sync() / schedule_delayed_work() are 
>> not necessary be protected here.
>
> Ok that could work as well.
>
> But in this case I would remove checking the return value of 
> cancel_delayed_work_sync() and just always ungate the power.
>
> This way we prevent ugly bugs from leaking in when this really races 
> sometimes.

Sure. Thanks!

James

>
> Regards,
> Christian.
>
>>
>> Best Regards!
>>
>> James
>>
>>>
>>> For all others it returns false and those would then think that they 
>>> need to ungate the power.
>>>
>>> The only solution I see is to either put both the 
>>> cancel_delayed_work_sync() and schedule_delayed_work() under the 
>>> same mutex protection or start to use an atomic or other counter to 
>>> note concurrent processing.
>>>
>>>> power gate is shared by all VCN IP instances and different rings , 
>>>> so it  needs be put into protection area.
>>>>
>>>> each ring's job itself is serialized by scheduler. it doesn't need 
>>>> be  put into this protection area.
>>>
>>> Yes, those should work as expected.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> James
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>       if (set_clocks) {
>>>>>>>           amdgpu_gfx_off_ctrl(adev, false);
>>>>>>>           amdgpu_device_ip_set_powergating_state(adev, 
>>>>>>> AMD_IP_BLOCK_TYPE_VCN,
>>>>>>> @@ -345,6 +348,7 @@ void amdgpu_vcn_ring_begin_use(struct 
>>>>>>> amdgpu_ring *ring)
>>>>>>>             adev->vcn.pause_dpg_mode(adev, ring->me, &new_state);
>>>>>>>       }
>>>>>>> +    mutex_unlock(&adev->vcn.vcn_pg_lock);
>>>>>>>   }
>>>>>>>     void amdgpu_vcn_ring_end_use(struct amdgpu_ring *ring)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> index 6fe0573..2ae110d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>> @@ -200,6 +200,7 @@ struct amdgpu_vcn {
>>>>>>>       struct drm_gpu_scheduler 
>>>>>>> *vcn_dec_sched[AMDGPU_MAX_VCN_INSTANCES];
>>>>>>>       uint32_t         num_vcn_enc_sched;
>>>>>>>       uint32_t         num_vcn_dec_sched;
>>>>>>> +    struct mutex         vcn_pg_lock;
>>>>>>>         unsigned    harvest_config;
>>>>>>>       int (*pause_dpg_mode)(struct amdgpu_device *adev,
>>>>>>
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CJames.Zhu%40amd.com%7Ce9f5dcb164bc4efacd7908d7c04d254f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637189309893957692&sdata=WcwmGejuagp5T9ojpSPLaU%2BmlHuidkh7SP3jDASpUSI%3D&reserved=0 
>>
>