[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.
Christian König
christian.koenig at amd.com
Mon May 16 15:08:27 UTC 2022
Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
>
> Ping
>
Ah, yes sorry.
> Andrey
>
> On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>>> Yes, exactly that's the idea.
>>>
>>> Basically the reset domain knowns which amdgpu devices it needs to
>>> reset together.
>>>
>>> If you then represent that so that you always have a hive even when
>>> you only have one device in it, or if you put an array of devices
>>> which needs to be reset together into the reset domain doesn't matter.
>>>
>>> Maybe go for the later approach, that is probably a bit cleaner and
>>> less code to change.
>>>
>>> Christian.
>>
>>
>> Unfortunately this approach raises also a few difficulties -
>> First - if holding array of devices in reset_domain then when you
>> come to GPU reset function you don't really know which adev is the
>> one triggered the reset and this is actually essential to some
>> procedures like emergency restart.
What is "emergency restart"? That's not some requirement I know about.
>>
>> Second - in XGMI case we must take into account that one of the hive
>> members might go away in runtime (i could do echo 1 >
>> /sysfs/pci_id/remove on it for example at any moment) - so now we
>> need to maintain this array and mark such entry with NULL probably on
>> XGMI node removal , and then there might be hot insertion and all
>> this adds more complications.
>>
>> I now tend to prefer your initial solution for it's simplicity and
>> the result will be what we need -
>>
>> "E.g. in the reset code (either before or after the reset, that's
>> debatable) you do something like this:
>>
>> for (i = 0; i < num_ring; ++i)
>> cancel_delayed_work(ring[i]->scheduler....)
>> cancel_work(adev->ras_work);
>> cancel_work(adev->iofault_work);
>> cancel_work(adev->debugfs_work);
>> "
Works for me. I already expected that switching over the reset to be
based on the reset context wouldn't be that easy.
Regards,
Christian.
>>
>> Let me know what you think.
>>
>> Andrey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220516/0ebfc135/attachment-0001.htm>
More information about the amd-gfx
mailing list