Re: 回复: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test

Lazar, Lijo lijo.lazar at amd.com
Mon Sep 13 13:50:18 UTC 2021


Thanks for the clarification Xinhui.

Based on Christian's explanation, what I understood is - this is an 
exceptional case in debugfs calls and the other goal is to avoid 
maintenance of one more lock just to support this API. I no longer have 
any issues with this approach.

Thanks,
Lijo

On 9/13/2021 12:53 PM, Pan, Xinhui wrote:
> [AMD Official Use Only]
> 
> Of source IB test can hang the GPU.
> But it wait fence with one specific timeout. and it not depends on gpu scheduler.
> So IB test must can return.
> 
> ________________________________________
> 发件人: Lazar, Lijo <Lijo.Lazar at amd.com>
> 发送时间: 2021年9月13日 15:15
> 收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx at lists.freedesktop.org
> 抄送: Deucher, Alexander
> 主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
> 
> 
> 
> On 9/13/2021 12:21 PM, Christian König wrote:
>> Keep in mind that we don't try to avoid contention here. The goal is
>> rather to have as few locks as possible to avoid the extra overhead in
>> the hot path.
>>
>> Contention is completely irrelevant for the debug and device reset since
>> that are rarely occurring events and performance doesn't matter for them.
>>
>> It is perfectly reasonable to take the write side of the reset lock as
>> necessary when we need to make sure that we don't have concurrent device
>> access.
> 
> The original code has down_read which gave the impression that there is
> some protection to avoid access during reset. Basically would like to
> avoid this as a precedence for this sort of usage for any debugfs call.
> Reset semaphore is supposed to be a 'protect all' thing and provides a
> shortcut.
> 
> BTW, question about a hypothetical case - what happens if the test
> itself causes a hang and need to trigger a reset? Will there be chance
> for the lock to be released (whether a submit call will hang
> indefinitely) for the actual reset to be executed?
> 
> Thanks,
> Lijo
> 
>>
>> Regards,
>> Christian.
>>
>> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>>> There are other interfaces to emulate the exact reset process, or
>>> atleast this is not the one we are using for doing any sort of reset
>>> through debugfs.
>>>
>>> In any case, the expectation is reset thread takes the write side of
>>> the lock and it's already done somewhere else.
>>>
>>> Reset semaphore is supposed to protect the device from concurrent
>>> access (any sort of resource usage is thus protected by default). Then
>>> the same logic can be applied for any other call and that is not a
>>> reasonable ask.
>>>
>>> Thanks,
>>> Lijo
>>>
>>> On 9/13/2021 12:07 PM, Christian König wrote:
>>>> That's complete nonsense.
>>>>
>>>> The debugfs interface emulates parts of the reset procedure for
>>>> testing and we absolutely need to take the same locks as the reset to
>>>> avoid corruption of the involved objects.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>>> This is a debugfs interface and adding another writer contention in
>>>>> debugfs over an actual reset is lazy fix. This shouldn't be executed
>>>>> in the first place and should not take precedence over any reset.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>
>>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>>
>>>>>> The reset semaphore protects the scheduler and ring objects from
>>>>>> concurrent modification, so taking the write side of it is
>>>>>> perfectly valid here.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> yep, that is a lazy way to fix it.
>>>>>>>
>>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before
>>>>>>> we issue test_ib on each ring.
>>>>>>> ________________________________________
>>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>>
>>>>>>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> @@ -1358,7 +1358,7 @@ static int
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         /* Avoid accidently unparking the sched thread during GPU
>>>>>>>> reset */
>>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>>> There are many ioctls and debugfs calls which takes this lock and
>>>>>>> as you
>>>>>>> know the purpose is to avoid them while there is a reset. The
>>>>>>> purpose is
>>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>>> otherwise and fixing those concurrency issues this way is just
>>>>>>> lazy and
>>>>>>> not acceptable.
>>>>>>>
>>>>>>> This will take away any fairness given to the writer in this rw
>>>>>>> lock and
>>>>>>> that is supposed to be the reset thread.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>         if (r)
>>>>>>>>                 return r;
>>>>>>>>
>>>>>>>> @@ -1387,7 +1387,7 @@ static int
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>>                 kthread_unpark(ring->sched.thread);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>>
>>>>>>>>         pm_runtime_mark_last_busy(dev->dev);
>>>>>>>>         pm_runtime_put_autosuspend(dev->dev);
>>>>>>>>
>>>>>>
>>>>
>>


More information about the amd-gfx mailing list