<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<br>
<br>
<div class="moz-cite-prefix">Am 18.05.22 um 16:24 schrieb Andrey
Grodzovsky:<br>
</div>
<blockquote type="cite"
cite="mid:ce60a983-9906-e33f-a2cc-6fedb958a124@amd.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<p><br>
</p>
<div class="moz-cite-prefix">On 2022-05-18 02:07, Christian König
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com">Am
17.05.22 um 21:20 schrieb Andrey Grodzovsky: <br>
<blockquote type="cite">Problem: <br>
During hive reset caused by command timing out on a ring <br>
extra resets are generated by triggered by KFD which is <br>
unable to accesses registers on the resetting ASIC. <br>
<br>
Fix: Rework GPU reset to actively stop any pending reset <br>
works while another in progress. <br>
<br>
v2: Switch from generic list as was in v1[1] to eplicit <br>
stopping of each reset request from each reset source <br>
per each request submitter. <br>
</blockquote>
<br>
Looks mostly good to me. <br>
<br>
Apart from the naming nit pick on patch #1 the only thing I
couldn't of hand figure out is why you are using a delayed work
everywhere instead of a just a work item. <br>
<br>
That needs a bit further explanation what's happening here. <br>
<br>
Christian. <br>
</blockquote>
<p><br>
</p>
<p>Check APIs for cancelling work vs. delayed work -</p>
<p>For work_struct the only public API is this - <a
class="moz-txt-link-freetext"
href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214"
moz-do-not-send="true">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214</a>
- blocking cancel.</p>
<p>For delayed_work we have both blocking and non blocking public
APIs - <br>
</p>
<p><a class="moz-txt-link-freetext"
href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295"
moz-do-not-send="true">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295</a></p>
<p><a class="moz-txt-link-freetext"
href="https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295"
moz-do-not-send="true">https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295</a></p>
<p>I prefer not to go now into convincing core kernel people of
exposing another interface for our own sake - from my past
experience API changes in core code has slim chances and a lot
of time spent on back and forth arguments.</p>
<p>"<span style="color: rgb(32, 33, 36); font-family: arial,
sans-serif; font-size: 16px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
letter-spacing: normal; orphans: 2; text-align: left;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);
text-decoration-thickness: initial; text-decoration-style:
initial; text-decoration-color: initial;">If the mountain will
not come to Muhammad, then Muhammad must go to the mountain"
;)</span><b style="color: rgb(32, 33, 36); font-family: arial,
sans-serif; font-size: 16px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
letter-spacing: normal; orphans: 2; text-align: left;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);
text-decoration-thickness: initial; text-decoration-style:
initial; text-decoration-color: initial;"><br>
</b></p>
</blockquote>
<br>
Ah, good point. The cancel_work() function was removed a few years
ago:<br>
<br>
commit 6417250d3f894e66a68ba1cd93676143f2376a6f<br>
Author: Stephen Hemminger <a class="moz-txt-link-rfc2396E" href="mailto:stephen@networkplumber.org"><stephen@networkplumber.org></a><br>
Date: Tue Mar 6 19:34:42 2018 -0800<br>
<br>
workqueue: remove unused cancel_work()<br>
<br>
Found this by accident.<br>
There are no usages of bare cancel_work() in current kernel
source.<br>
<br>
Signed-off-by: Stephen Hemminger
<a class="moz-txt-link-rfc2396E" href="mailto:stephen@networkplumber.org"><stephen@networkplumber.org></a><br>
Signed-off-by: Tejun Heo <a class="moz-txt-link-rfc2396E" href="mailto:tj@kernel.org"><tj@kernel.org></a><br>
<br>
<br>
Maybe just revert that patch, export the function and use it. I
think there is plenty of justification for this.<br>
<br>
Thanks,<br>
Christian.<br>
<br>
<blockquote type="cite"
cite="mid:ce60a983-9906-e33f-a2cc-6fedb958a124@amd.com">
<p><b style="color: rgb(32, 33, 36); font-family: arial,
sans-serif; font-size: 16px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
letter-spacing: normal; orphans: 2; text-align: left;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);
text-decoration-thickness: initial; text-decoration-style:
initial; text-decoration-color: initial;"> </b></p>
<p>Andrey</p>
<blockquote type="cite"
cite="mid:1a7fd05f-490b-9999-5f0b-e84af26504a9@amd.com"> <br>
<blockquote type="cite"> <br>
[1] -
<a class="moz-txt-link-freetext"
href="https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/"
moz-do-not-send="true">https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/</a><br>
<br>
Andrey Grodzovsky (7): <br>
drm/amdgpu: Cache result of last reset at reset domain
level. <br>
drm/amdgpu: Switch to delayed work from work_struct. <br>
drm/admgpu: Serialize RAS recovery work directly into reset
domain <br>
queue. <br>
drm/amdgpu: Add delayed work for GPU reset from debugfs <br>
drm/amdgpu: Add delayed work for GPU reset from kfd. <br>
drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to <br>
amdgpu_device_gpu_recover <br>
drm/amdgpu: Stop any pending reset if another in progress.
<br>
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 + <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62
+++++++++++----------- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++-- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +- <br>
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +-- <br>
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +-- <br>
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +-- <br>
14 files changed, 87 insertions(+), 54 deletions(-) <br>
<br>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</body>
</html>