[PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

Andrey Grodzovsky andrey.grodzovsky at amd.com
Thu May 19 13:41:32 UTC 2022


On 2022-05-19 03:58, Christian König wrote:
>
>
> Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>>
>>
>> On 2022-05-18 02:07, Christian König wrote:
>>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> During hive reset caused by command timing out on a ring
>>>> extra resets are generated by triggered by KFD which is
>>>> unable to accesses registers on the resetting ASIC.
>>>>
>>>> Fix: Rework GPU reset to actively stop any pending reset
>>>> works while another in progress.
>>>>
>>>> v2: Switch from generic list as was in v1[1] to eplicit
>>>> stopping of each reset request from each reset source
>>>> per each request submitter.
>>>
>>> Looks mostly good to me.
>>>
>>> Apart from the naming nit pick on patch #1 the only thing I couldn't 
>>> of hand figure out is why you are using a delayed work everywhere 
>>> instead of a just a work item.
>>>
>>> That needs a bit further explanation what's happening here.
>>>
>>> Christian.
>>
>>
>> Check APIs for cancelling work vs. delayed work -
>>
>> For work_struct the only public API is this - 
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
>> - blocking cancel.
>>
>> For delayed_work we have both blocking and non blocking public APIs -
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> I prefer not to go now into convincing core kernel people of exposing 
>> another interface for our own sake - from my past experience API 
>> changes in core code has slim chances and a lot of time spent on back 
>> and forth arguments.
>>
>> "If the mountain will not come to Muhammad, then Muhammad must go to 
>> the mountain" ;)*
>> *
>>
>
> Ah, good point. The cancel_work() function was removed a few years ago:
>
> commit 6417250d3f894e66a68ba1cd93676143f2376a6f
> Author: Stephen Hemminger <stephen at networkplumber.org>
> Date:   Tue Mar 6 19:34:42 2018 -0800
>
>     workqueue: remove unused cancel_work()
>
>     Found this by accident.
>     There are no usages of bare cancel_work() in current kernel source.
>
>     Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
>     Signed-off-by: Tejun Heo <tj at kernel.org>
>
>
> Maybe just revert that patch, export the function and use it. I think 
> there is plenty of justification for this.
>
> Thanks,
> Christian.


Ok - i will send them a patch - let's see what they say.

Andrey


>
>> **
>>
>> Andrey
>>
>>>
>>>>
>>>> [1] - 
>>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>>
>>>> Andrey Grodzovsky (7):
>>>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>>>    drm/amdgpu: Switch to delayed work from work_struct.
>>>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>>>      queue.
>>>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>>>      amdgpu_device_gpu_recover
>>>>    drm/amdgpu: Stop any pending reset if another in progress.
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 
>>>> +++++++++++-----------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220519/fbd65d38/attachment-0001.htm>


More information about the amd-gfx mailing list