[PATCH 1/4] drm/amdgpu/vcn: fix race condition issue for vcn start

Tue Mar 3 22:48:08 UTC 2020

On 2020-03-03 2:03 p.m., James Zhu wrote:
>
> On 2020-03-03 1:44 p.m., Christian König wrote:
>> Am 03.03.20 um 19:16 schrieb James Zhu:
>>> Fix race condition issue when multiple vcn starts are called.
>>>
>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
>>>   2 files changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>> index f96464e..aa7663f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>> @@ -63,6 +63,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
>>>       int i, r;
>>>         INIT_DELAYED_WORK(&adev->vcn.idle_work, 
>>> amdgpu_vcn_idle_work_handler);
>>> +    mutex_init(&adev->vcn.vcn_pg_lock);
>>>         switch (adev->asic_type) {
>>>       case CHIP_RAVEN:
>>> @@ -210,6 +211,7 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device *adev)
>>>       }
>>>         release_firmware(adev->vcn.fw);
>>> +    mutex_destroy(&adev->vcn.vcn_pg_lock);
>>>         return 0;
>>>   }
>>> @@ -321,6 +323,7 @@ void amdgpu_vcn_ring_begin_use(struct 
>>> amdgpu_ring *ring)
>>>       struct amdgpu_device *adev = ring->adev;
>>>       bool set_clocks = 
>>> !cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>   +    mutex_lock(&adev->vcn.vcn_pg_lock);
>>
>> That still won't work correctly here.
>>
>> The whole idea of the cancel_delayed_work_sync() and 
>> schedule_delayed_work() dance is that you have exactly one user of 
>> that. If you have multiple rings that whole thing won't work correctly.
>>
>> To fix this you need to call mutex_lock() before 
>> cancel_delayed_work_sync() and schedule_delayed_work() before 
>> mutex_unlock().
>
> Big lock definitely works. I am trying to use as smaller lock as 
> possible here. the share resource which needs protect here are power 
> gate process and dpg mode switch process.
>
> if we move mutex_unlock() before schedule_delayed_work(. I am 
> wondering what are the other necessary resources which need protect.

By the way, cancel_delayed_work_sync() supports multiple thread itself, 
so I didn't put it into protection area.  power  gate is shared by all 
VCN IP instances and different rings , so it  needs be put into 
protection area.

each ring's job itself is serialized by scheduler. it doesn't need be  
put into this protection area.

>
> Thanks!
>
> James
>
>>
>> Regards,
>> Christian.
>>
>>>       if (set_clocks) {
>>>           amdgpu_gfx_off_ctrl(adev, false);
>>>           amdgpu_device_ip_set_powergating_state(adev, 
>>> AMD_IP_BLOCK_TYPE_VCN,
>>> @@ -345,6 +348,7 @@ void amdgpu_vcn_ring_begin_use(struct 
>>> amdgpu_ring *ring)
>>>             adev->vcn.pause_dpg_mode(adev, ring->me, &new_state);
>>>       }
>>> +    mutex_unlock(&adev->vcn.vcn_pg_lock);
>>>   }
>>>     void amdgpu_vcn_ring_end_use(struct amdgpu_ring *ring)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>> index 6fe0573..2ae110d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>> @@ -200,6 +200,7 @@ struct amdgpu_vcn {
>>>       struct drm_gpu_scheduler 
>>> *vcn_dec_sched[AMDGPU_MAX_VCN_INSTANCES];
>>>       uint32_t         num_vcn_enc_sched;
>>>       uint32_t         num_vcn_dec_sched;
>>> +    struct mutex         vcn_pg_lock;
>>>         unsigned    harvest_config;
>>>       int (*pause_dpg_mode)(struct amdgpu_device *adev,
>>