<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div class="moz-cite-prefix">On 2022-05-18 02:07, Christian König wrote: </div> <blockquote type="cite" cite="mid:1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com">Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky: <blockquote type="cite">Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to actively stop any pending reset works while another in progress. v2: Switch from generic list as was in v1[1] to eplicit stopping of each reset request from each reset source per each request submitter. </blockquote> Looks mostly good to me. Apart from the naming nit pick on patch #1 the only thing I couldn't of hand figure out is why you are using a delayed work everywhere instead of a just a work item. That needs a bit further explanation what's happening here. Christian. </blockquote> Check APIs for cancelling work vs. delayed work - For work_struct the only public API is this - <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214</a> - blocking cancel. For delayed_work we have both blocking and non blocking public APIs - <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295</a> <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295</a> I prefer not to go now into convincing core kernel people of exposing another interface for our own sake - from my past experience API changes in core code has slim chances and a lot of time spent on back and forth arguments. "<span style="color: rgb(32, 33, 36); font-family: arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">If the mountain will not come to Muhammad, then Muhammad must go to the mountain" ;)<b style="color: rgb(32, 33, 36); font-family: arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Andrey <blockquote type="cite" cite="mid:1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com"> <blockquote type="cite"> [1] - <a class="moz-txt-link-freetext" href="https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/">https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/</a> Andrey Grodzovsky (7): drm/amdgpu: Cache result of last reset at reset domain level. drm/amdgpu: Switch to delayed work from work_struct. drm/admgpu: Serialize RAS recovery work directly into reset domain queue. drm/amdgpu: Add delayed work for GPU reset from debugfs drm/amdgpu: Add delayed work for GPU reset from kfd. drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover drm/amdgpu: Stop any pending reset if another in progress. drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++----------- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +-- 14 files changed, 87 insertions(+), 54 deletions(-) </blockquote> </blockquote> </body> </html>