[PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
Christian König
ckoenig.leichtzumerken at gmail.com
Thu May 19 07:58:05 UTC 2022
Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>
>
> On 2022-05-18 02:07, Christian König wrote:
>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>> Problem:
>>> During hive reset caused by command timing out on a ring
>>> extra resets are generated by triggered by KFD which is
>>> unable to accesses registers on the resetting ASIC.
>>>
>>> Fix: Rework GPU reset to actively stop any pending reset
>>> works while another in progress.
>>>
>>> v2: Switch from generic list as was in v1[1] to eplicit
>>> stopping of each reset request from each reset source
>>> per each request submitter.
>>
>> Looks mostly good to me.
>>
>> Apart from the naming nit pick on patch #1 the only thing I couldn't
>> of hand figure out is why you are using a delayed work everywhere
>> instead of a just a work item.
>>
>> That needs a bit further explanation what's happening here.
>>
>> Christian.
>
>
> Check APIs for cancelling work vs. delayed work -
>
> For work_struct the only public API is this -
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214
> - blocking cancel.
>
> For delayed_work we have both blocking and non blocking public APIs -
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> I prefer not to go now into convincing core kernel people of exposing
> another interface for our own sake - from my past experience API
> changes in core code has slim chances and a lot of time spent on back
> and forth arguments.
>
> "If the mountain will not come to Muhammad, then Muhammad must go to
> the mountain" ;)*
> *
>
Ah, good point. The cancel_work() function was removed a few years ago:
commit 6417250d3f894e66a68ba1cd93676143f2376a6f
Author: Stephen Hemminger <stephen at networkplumber.org>
Date: Tue Mar 6 19:34:42 2018 -0800
workqueue: remove unused cancel_work()
Found this by accident.
There are no usages of bare cancel_work() in current kernel source.
Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
Signed-off-by: Tejun Heo <tj at kernel.org>
Maybe just revert that patch, export the function and use it. I think
there is plenty of justification for this.
Thanks,
Christian.
> **
>
> Andrey
>
>>
>>>
>>> [1] -
>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>
>>> Andrey Grodzovsky (7):
>>> drm/amdgpu: Cache result of last reset at reset domain level.
>>> drm/amdgpu: Switch to delayed work from work_struct.
>>> drm/admgpu: Serialize RAS recovery work directly into reset domain
>>> queue.
>>> drm/amdgpu: Add delayed work for GPU reset from debugfs
>>> drm/amdgpu: Add delayed work for GPU reset from kfd.
>>> drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>> amdgpu_device_gpu_recover
>>> drm/amdgpu: Stop any pending reset if another in progress.
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62
>>> +++++++++++-----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++--
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +--
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +--
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +--
>>> 14 files changed, 87 insertions(+), 54 deletions(-)
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220519/5e8d3233/attachment-0001.htm>
More information about the amd-gfx
mailing list