[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

Mon May 16 14:12:14 UTC 2022

Ping

Andrey

On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>> Yes, exactly that's the idea.
>>
>> Basically the reset domain knowns which amdgpu devices it needs to 
>> reset together.
>>
>> If you then represent that so that you always have a hive even when 
>> you only have one device in it, or if you put an array of devices 
>> which needs to be reset together into the reset domain doesn't matter.
>>
>> Maybe go for the later approach, that is probably a bit cleaner and 
>> less code to change.
>>
>> Christian.
>
>
> Unfortunately this approach raises also a few  difficulties -
> First - if holding array of devices in reset_domain then when you come 
> to GPU reset function you don't really know which adev is the one 
> triggered the reset and this is actually essential to some procedures 
> like emergency restart.
>
> Second - in XGMI case we must take into account that one of the hive 
> members might go away in runtime (i could do echo 1 > 
> /sysfs/pci_id/remove on it for example at any moment) - so now we need 
> to maintain this array and mark such entry with NULL probably on XGMI 
> node removal , and then there might be hot insertion and all this adds 
> more complications.
>
> I now tend to prefer your initial solution for it's simplicity and the 
> result will be what we need -
>
> "E.g. in the reset code (either before or after the reset, that's 
> debatable) you do something like this:
>
> for (i = 0; i < num_ring; ++i)
> cancel_delayed_work(ring[i]->scheduler....)
> cancel_work(adev->ras_work);
> cancel_work(adev->iofault_work);
> cancel_work(adev->debugfs_work);
> "
>
> Let me know what you think.
>
> Andrey 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220516/e5bfc12e/attachment-0001.htm>