[PATCH 1/4] drm/amdgpu/vcn: fix race condition issue for vcn start
James Zhu
jamesz at amd.com
Wed Mar 4 14:57:15 UTC 2020
On 2020-03-04 3:53 a.m., Christian König wrote:
> Am 03.03.20 um 23:48 schrieb James Zhu:
>>
>> On 2020-03-03 2:03 p.m., James Zhu wrote:
>>>
>>> On 2020-03-03 1:44 p.m., Christian König wrote:
>>>> Am 03.03.20 um 19:16 schrieb James Zhu:
>>>>> Fix race condition issue when multiple vcn starts are called.
>>>>>
>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
>>>>> 2 files changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>> index f96464e..aa7663f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>> @@ -63,6 +63,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
>>>>> int i, r;
>>>>> INIT_DELAYED_WORK(&adev->vcn.idle_work,
>>>>> amdgpu_vcn_idle_work_handler);
>>>>> + mutex_init(&adev->vcn.vcn_pg_lock);
>>>>> switch (adev->asic_type) {
>>>>> case CHIP_RAVEN:
>>>>> @@ -210,6 +211,7 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device
>>>>> *adev)
>>>>> }
>>>>> release_firmware(adev->vcn.fw);
>>>>> + mutex_destroy(&adev->vcn.vcn_pg_lock);
>>>>> return 0;
>>>>> }
>>>>> @@ -321,6 +323,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>> amdgpu_ring *ring)
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> bool set_clocks =
>>>>> !cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>>> + mutex_lock(&adev->vcn.vcn_pg_lock);
>>>>
>>>> That still won't work correctly here.
>>>>
>>>> The whole idea of the cancel_delayed_work_sync() and
>>>> schedule_delayed_work() dance is that you have exactly one user of
>>>> that. If you have multiple rings that whole thing won't work
>>>> correctly.
>>>>
>>>> To fix this you need to call mutex_lock() before
>>>> cancel_delayed_work_sync() and schedule_delayed_work() before
>>>> mutex_unlock().
>>>
>>> Big lock definitely works. I am trying to use as smaller lock as
>>> possible here. the share resource which needs protect here are power
>>> gate process and dpg mode switch process.
>>>
>>> if we move mutex_unlock() before schedule_delayed_work(. I am
>>> wondering what are the other necessary resources which need protect.
>>
>> By the way, cancel_delayed_work_sync() supports multiple thread
>> itself, so I didn't put it into protection area.
>
> Yeah, but that's correct but it still won't working correctly :)
>
> See the problem is that only for the first caller
> cancel_delayed_work_sync() returns true because it canceled the
> delayed work.
if the 1st caller gets true. the 2nd caller unfortunately may miss this
pending status, so it will ungate the power which is unexpected.
But in power gate/ungate function, a power state is maintained, so this
miss won't be really triggered to ungate the power.
So I think cancel_delayed_work_sync() / schedule_delayed_work() are not
necessary be protected here.
Best Regards!
James
>
> For all others it returns false and those would then think that they
> need to ungate the power.
>
> The only solution I see is to either put both the
> cancel_delayed_work_sync() and schedule_delayed_work() under the same
> mutex protection or start to use an atomic or other counter to note
> concurrent processing.
>
>> power gate is shared by all VCN IP instances and different rings , so
>> it needs be put into protection area.
>>
>> each ring's job itself is serialized by scheduler. it doesn't need
>> be put into this protection area.
>
> Yes, those should work as expected.
>
> Regards,
> Christian.
>
>>
>>>
>>> Thanks!
>>>
>>> James
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> if (set_clocks) {
>>>>> amdgpu_gfx_off_ctrl(adev, false);
>>>>> amdgpu_device_ip_set_powergating_state(adev,
>>>>> AMD_IP_BLOCK_TYPE_VCN,
>>>>> @@ -345,6 +348,7 @@ void amdgpu_vcn_ring_begin_use(struct
>>>>> amdgpu_ring *ring)
>>>>> adev->vcn.pause_dpg_mode(adev, ring->me, &new_state);
>>>>> }
>>>>> + mutex_unlock(&adev->vcn.vcn_pg_lock);
>>>>> }
>>>>> void amdgpu_vcn_ring_end_use(struct amdgpu_ring *ring)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>> index 6fe0573..2ae110d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>> @@ -200,6 +200,7 @@ struct amdgpu_vcn {
>>>>> struct drm_gpu_scheduler
>>>>> *vcn_dec_sched[AMDGPU_MAX_VCN_INSTANCES];
>>>>> uint32_t num_vcn_enc_sched;
>>>>> uint32_t num_vcn_dec_sched;
>>>>> + struct mutex vcn_pg_lock;
>>>>> unsigned harvest_config;
>>>>> int (*pause_dpg_mode)(struct amdgpu_device *adev,
>>>>
>
More information about the amd-gfx
mailing list