[PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

Fri Feb 5 16:22:46 UTC 2021

Daniel, ping. Also, please refer to the other thread with Bjorn from pci-dev
on the same topic I added you to.

Andrey

On 1/29/21 2:25 PM, Christian König wrote:
> Am 29.01.21 um 18:35 schrieb Andrey Grodzovsky:
>>
>> On 1/29/21 10:16 AM, Christian König wrote:
>>> Am 28.01.21 um 18:23 schrieb Andrey Grodzovsky:
>>>>
>>>> On 1/19/21 1:59 PM, Christian König wrote:
>>>>> Am 19.01.21 um 19:22 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 1/19/21 1:05 PM, Daniel Vetter wrote:
>>>>>>> [SNIP]
>>>>>> So say writing in a loop to some harmless scratch register for many times 
>>>>>> both for plugged
>>>>>> and unplugged case and measure total time delta ?
>>>>>
>>>>> I think we should at least measure the following:
>>>>>
>>>>> 1. Writing X times to a scratch reg without your patch.
>>>>> 2. Writing X times to a scratch reg with your patch.
>>>>> 3. Writing X times to a scratch reg with the hardware physically disconnected.
>>>>>
>>>>> I suggest to repeat that once for Polaris (or older) and once for Vega or 
>>>>> Navi.
>>>>>
>>>>> The SRBM on Polaris is meant to introduce some delay in each access, so it 
>>>>> might react differently then the newer hardware.
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> See attached results and the testing code. Ran on Polaris (gfx8) and 
>>>> Vega10(gfx9)
>>>>
>>>> In summary, over 1 million WWREG32 in loop with and without this patch you 
>>>> get around 10ms of accumulated overhead ( so 0.00001 millisecond penalty for 
>>>> each WWREG32) for using drm_dev_enter check when writing registers.
>>>>
>>>> P.S Bullet 3 I cannot test as I need eGPU and currently I don't have one.
>>>
>>> Well if I'm not completely mistaken that are 100ms of accumulated overhead. 
>>> So around 100ns per write. And even bigger problem is that this is a ~67% 
>>> increase.
>>
>>
>> My bad, and 67% from what ? How u calculate ?
> 
> My bad, (308501-209689)/209689=47% increase.
> 
>>>
>>> I'm not sure how many write we do during normal operation, but that sounds 
>>> like a bit much. Ideas?
>>
>> Well, u suggested to move the drm_dev_enter way up but as i see it the problem 
>> with this is that it increase the chance of race where the
>> device is extracted after we check for drm_dev_enter (there is also such 
>> chance even when it's placed inside WWREG but it's lower).
>> Earlier I propsed that instead of doing all those guards scattered all over 
>> the code simply delay release of system memory pages and unreserve of
>> MMIO ranges to until after the device itself is gone after last drm device 
>> reference is dropped. But Daniel opposes delaying MMIO ranges unreserve to after
>> PCI remove code because according to him it will upset the PCI subsytem.
> 
> Yeah, that's most likely true as well.
> 
> Maybe Daniel has another idea when he's back from vacation.
> 
> Christian.
> 
>>
>> Andrey
>>
>>>
>>> Christian.
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7e63c7ba9ac44d80163108d8c48b9507%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637475451078731703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SozIYYmHpkk%2B4PRycs8T7x1DYagThy6lQoFXV5Ddamk%3D&reserved=0 
>>
>