回复: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test

Pan, Xinhui Xinhui.Pan at amd.com
Mon Sep 13 07:23:28 UTC 2021


[AMD Official Use Only]

Of source IB test can hang the GPU.
But it wait fence with one specific timeout. and it not depends on gpu scheduler.
So IB test must can return.

________________________________________
发件人: Lazar, Lijo <Lijo.Lazar at amd.com>
发送时间: 2021年9月13日 15:15
收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx at lists.freedesktop.org
抄送: Deucher, Alexander
主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test



On 9/13/2021 12:21 PM, Christian König wrote:
> Keep in mind that we don't try to avoid contention here. The goal is
> rather to have as few locks as possible to avoid the extra overhead in
> the hot path.
>
> Contention is completely irrelevant for the debug and device reset since
> that are rarely occurring events and performance doesn't matter for them.
>
> It is perfectly reasonable to take the write side of the reset lock as
> necessary when we need to make sure that we don't have concurrent device
> access.

The original code has down_read which gave the impression that there is
some protection to avoid access during reset. Basically would like to
avoid this as a precedence for this sort of usage for any debugfs call.
Reset semaphore is supposed to be a 'protect all' thing and provides a
shortcut.

BTW, question about a hypothetical case - what happens if the test
itself causes a hang and need to trigger a reset? Will there be chance
for the lock to be released (whether a submit call will hang
indefinitely) for the actual reset to be executed?

Thanks,
Lijo

>
> Regards,
> Christian.
>
> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>> There are other interfaces to emulate the exact reset process, or
>> atleast this is not the one we are using for doing any sort of reset
>> through debugfs.
>>
>> In any case, the expectation is reset thread takes the write side of
>> the lock and it's already done somewhere else.
>>
>> Reset semaphore is supposed to protect the device from concurrent
>> access (any sort of resource usage is thus protected by default). Then
>> the same logic can be applied for any other call and that is not a
>> reasonable ask.
>>
>> Thanks,
>> Lijo
>>
>> On 9/13/2021 12:07 PM, Christian König wrote:
>>> That's complete nonsense.
>>>
>>> The debugfs interface emulates parts of the reset procedure for
>>> testing and we absolutely need to take the same locks as the reset to
>>> avoid corruption of the involved objects.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>> This is a debugfs interface and adding another writer contention in
>>>> debugfs over an actual reset is lazy fix. This shouldn't be executed
>>>> in the first place and should not take precedence over any reset.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>
>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>
>>>>> The reset semaphore protects the scheduler and ring objects from
>>>>> concurrent modification, so taking the write side of it is
>>>>> perfectly valid here.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> yep, that is a lazy way to fix it.
>>>>>>
>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before
>>>>>> we issue test_ib on each ring.
>>>>>> ________________________________________
>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>
>>>>>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> @@ -1358,7 +1358,7 @@ static int
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>        }
>>>>>>>
>>>>>>>        /* Avoid accidently unparking the sched thread during GPU
>>>>>>> reset */
>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>> There are many ioctls and debugfs calls which takes this lock and
>>>>>> as you
>>>>>> know the purpose is to avoid them while there is a reset. The
>>>>>> purpose is
>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>> otherwise and fixing those concurrency issues this way is just
>>>>>> lazy and
>>>>>> not acceptable.
>>>>>>
>>>>>> This will take away any fairness given to the writer in this rw
>>>>>> lock and
>>>>>> that is supposed to be the reset thread.
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>        if (r)
>>>>>>>                return r;
>>>>>>>
>>>>>>> @@ -1387,7 +1387,7 @@ static int
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>                kthread_unpark(ring->sched.thread);
>>>>>>>        }
>>>>>>>
>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>
>>>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>>>
>>>>>
>>>
>


More information about the amd-gfx mailing list