Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test

Lazar, Lijo lijo.lazar at amd.com
Mon Sep 13 06:43:36 UTC 2021


There are other interfaces to emulate the exact reset process, or 
atleast this is not the one we are using for doing any sort of reset 
through debugfs.

In any case, the expectation is reset thread takes the write side of the 
lock and it's already done somewhere else.

Reset semaphore is supposed to protect the device from concurrent access 
(any sort of resource usage is thus protected by default). Then the same 
logic can be applied for any other call and that is not a reasonable ask.

Thanks,
Lijo

On 9/13/2021 12:07 PM, Christian König wrote:
> That's complete nonsense.
> 
> The debugfs interface emulates parts of the reset procedure for testing 
> and we absolutely need to take the same locks as the reset to avoid 
> corruption of the involved objects.
> 
> Regards,
> Christian.
> 
> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>> This is a debugfs interface and adding another writer contention in 
>> debugfs over an actual reset is lazy fix. This shouldn't be executed 
>> in the first place and should not take precedence over any reset.
>>
>> Thanks,
>> Lijo
>>
>>
>> On 9/13/2021 11:52 AM, Christian König wrote:
>>> NAK, this is not the lazy way to fix it at all.
>>>
>>> The reset semaphore protects the scheduler and ring objects from 
>>> concurrent modification, so taking the write side of it is perfectly 
>>> valid here.
>>>
>>> Christian.
>>>
>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>> [AMD Official Use Only]
>>>>
>>>> yep, that is a lazy way to fix it.
>>>>
>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before 
>>>> we issue test_ib on each ring.
>>>> ________________________________________
>>>> 发件人: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>> 发送时间: 2021年9月13日 12:00
>>>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>
>>>>
>>>>
>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>
>>>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>>> seq_file *m, void *unused)
>>>>>        }
>>>>>
>>>>>        /* Avoid accidently unparking the sched thread during GPU 
>>>>> reset */
>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>> There are many ioctls and debugfs calls which takes this lock and as 
>>>> you
>>>> know the purpose is to avoid them while there is a reset. The 
>>>> purpose is
>>>> *not to* fix any concurrency issues those calls themselves have
>>>> otherwise and fixing those concurrency issues this way is just lazy and
>>>> not acceptable.
>>>>
>>>> This will take away any fairness given to the writer in this rw lock 
>>>> and
>>>> that is supposed to be the reset thread.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>        if (r)
>>>>>                return r;
>>>>>
>>>>> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>>> seq_file *m, void *unused)
>>>>>                kthread_unpark(ring->sched.thread);
>>>>>        }
>>>>>
>>>>> -     up_read(&adev->reset_sem);
>>>>> +     up_write(&adev->reset_sem);
>>>>>
>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>
>>>
> 


More information about the amd-gfx mailing list