[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

Mon May 16 15:13:42 UTC 2022

On 2022-05-16 11:08, Christian König wrote:
> Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
>>
>> Ping
>>
>
> Ah, yes sorry.
>
>> Andrey
>>
>> On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>>>> Yes, exactly that's the idea.
>>>>
>>>> Basically the reset domain knowns which amdgpu devices it needs to 
>>>> reset together.
>>>>
>>>> If you then represent that so that you always have a hive even when 
>>>> you only have one device in it, or if you put an array of devices 
>>>> which needs to be reset together into the reset domain doesn't matter.
>>>>
>>>> Maybe go for the later approach, that is probably a bit cleaner and 
>>>> less code to change.
>>>>
>>>> Christian.
>>>
>>>
>>> Unfortunately this approach raises also a few  difficulties -
>>> First - if holding array of devices in reset_domain then when you 
>>> come to GPU reset function you don't really know which adev is the 
>>> one triggered the reset and this is actually essential to some 
>>> procedures like emergency restart.
>
> What is "emergency restart"? That's not some requirement I know about.

Emergency restart is something u can see at the beginning of 
amdgpu_gpu_recover function - it's a historical work around for some 
type of ASICs who weren't able to do full reset I think.  We should 
eventually remove it bu for now I thin it's still in use.

>
>>>
>>> Second - in XGMI case we must take into account that one of the hive 
>>> members might go away in runtime (i could do echo 1 > 
>>> /sysfs/pci_id/remove on it for example at any moment) - so now we 
>>> need to maintain this array and mark such entry with NULL probably 
>>> on XGMI node removal , and then there might be hot insertion and all 
>>> this adds more complications.
>>>
>>> I now tend to prefer your initial solution for it's simplicity and 
>>> the result will be what we need -
>>>
>>> "E.g. in the reset code (either before or after the reset, that's 
>>> debatable) you do something like this:
>>>
>>> for (i = 0; i < num_ring; ++i)
>>> cancel_delayed_work(ring[i]->scheduler....)
>>> cancel_work(adev->ras_work);
>>> cancel_work(adev->iofault_work);
>>> cancel_work(adev->debugfs_work);
>>> "
>
> Works for me. I already expected that switching over the reset to be 
> based on the reset context wouldn't be that easy.
>
> Regards,
> Christian.

Ok - i will resend a patch.

Andrey

>
>>>
>>> Let me know what you think.
>>>
>>> Andrey 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220516/4b598a5e/attachment.htm>