<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2022-05-10 23:35, Shuotao Xu wrote:<br>
</div>
<blockquote type="cite" cite="mid:E51808D5-5E34-420C-9CBD-F2BAE26E45F5@microsoft.com">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On May 11, 2022, at 4:31 AM, Felix Kuehling <<a href="mailto:felix.kuehling@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">felix.kuehling@amd.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class=""><span style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">[Some people who received this
message don't often get email from<span class="Apple-converted-space"> </span></span><a href="mailto:felix.kuehling@amd.com" style="font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-size-adjust: auto; -webkit-text-stroke-width:
0px;" class="moz-txt-link-freetext" moz-do-not-send="true">felix.kuehling@amd.com</a><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;
float: none; display: inline !important;" class="">. Learn
why this is important at<span class="Apple-converted-space"> </span></span><a href="https://aka.ms/LearnAboutSenderIdentification" style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-size-adjust: auto; -webkit-text-stroke-width:
0px;" class="moz-txt-link-freetext" moz-do-not-send="true">https://aka.ms/LearnAboutSenderIdentification</a><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;
float: none; display: inline !important;" class="">.]</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">Am 2022-05-10 um 07:03 schrieb
Shuotao Xu:</span><br style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Apr 28, 2022, at 12:04
AM, Andrey Grodzovsky<br class="">
<<a href="mailto:andrey.grodzovsky@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">andrey.grodzovsky@amd.com</a>>
wrote:<br class="">
<br class="">
On 2022-04-27 05:20, Shuotao Xu wrote:<br class="">
<br class="">
<blockquote type="cite" class="">Hi Andrey,<br class="">
<br class="">
Sorry that I did not have time to work on this for a
few days.<br class="">
<br class="">
I just tried the sysfs crash fix on Radeon VII and it
seems that it<br class="">
worked. It did not pass last the hotplug test, but my
version has 4<br class="">
tests instead of 3 in your case.<br class="">
</blockquote>
<br class="">
<br class="">
That because the 4th one is only enabled when here are 2
cards in the<br class="">
system - to test DRI_PRIME export. I tested this time
with only one card.<br class="">
<br class="">
</blockquote>
Yes, I only had one Radeon VII in my system, so this 4th
test should<br class="">
have been skipped. I am ignoring this issue.<br class="">
<br class="">
<blockquote type="cite" class="">
<blockquote type="cite" class=""><br class="">
<br class="">
Suite: Hotunplug Tests<br class="">
Test: Unplug card and rescan the bus to plug it back<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
passed<br class="">
Test: Same as first test but with command submission<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
passed<br class="">
Test: Unplug with exported bo<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
passed<br class="">
Test: Unplug with exported fence<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
amdgpu_device_initialize: amdgpu_get_auth (1) failed
(-1)<br class="">
</blockquote>
<br class="">
<br class="">
on the kernel side - the IOCTlL returning this is
drm_getclient -<br class="">
maybe take a look while it can't find client it ? I
didn't have such<br class="">
issue as far as I remember when testing.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">FAILED<br class="">
1. ../tests/amdgpu/hotunplug_tests.c:368 -
CU_ASSERT_EQUAL(r,0)<br class="">
2. ../tests/amdgpu/hotunplug_tests.c:411 -<br class="">
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2,
shared_fd,<br class="">
&sync_obj_handle2),0)<br class="">
3. ../tests/amdgpu/hotunplug_tests.c:423 -<br class="">
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2,
&sync_obj_handle2,<br class="">
1, 100000000, 0, NULL),0)<br class="">
4. ../tests/amdgpu/hotunplug_tests.c:425 -<br class="">
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2,
sync_obj_handle2),0)<br class="">
<br class="">
Run Summary: Type Total Ran Passed Failed Inactive<br class="">
suites 14 1 n/a 0 0<br class="">
tests 71 4 3 1 0<br class="">
asserts 39 39 35 4 n/a<br class="">
<br class="">
Elapsed time = 17.321 seconds<br class="">
<br class="">
For kfd compute, there is some problem which I did not
see in MI100<br class="">
after I killed the hung application after hot plugout.
I was using<br class="">
rocm5.0.2 driver for MI100 card, and not sure if it is
a regression<br class="">
from the newer driver.<br class="">
After pkill, one of child of user process would be
stuck in Zombie<br class="">
mode (Z) understandably because of the bug, and future
rocm<br class="">
application after plug-back would in uninterrupted
sleep mode (D)<br class="">
because it would not return from syscall to kfd.<br class="">
<br class="">
Although drm test for amdgpu would run just fine
without issues<br class="">
after plug-back with dangling kfd state.<br class="">
</blockquote>
<br class="">
<br class="">
I am not clear when the crash bellow happens ? Is it
related to what<br class="">
you describe above ?<br class="">
<br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
I don’t know if there is a quick fix to it. I was
thinking add<br class="">
drm_enter/drm_exit to amdgpu_device_rreg.<br class="">
</blockquote>
<br class="">
<br class="">
Try adding drm_dev_enter/exit pair at the highest level
of attmetong<br class="">
to access HW - in this case it's
amdgpu_amdkfd_set_compute_idle. We<br class="">
always try to avoid accessing any HW functions after
backing device<br class="">
is gone.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">Also this has been a
long time in my attempt to fix hotplug issue<br class="">
for kfd application.<br class="">
I don’t know 1) if I would be able to get to MI100
(fixing Radeon<br class="">
VII would mean something but MI100 is more important
for us); 2)<br class="">
what the direct of the patch to this issue will move
forward.<br class="">
</blockquote>
<br class="">
<br class="">
I will go to office tomorrow to pick up MI-100, With
time and<br class="">
priorities permitting I will then then try to test it
and fix any<br class="">
bugs such that it will be passing all hot plug libdrm
tests at the<br class="">
tip of public amd-staging-drm-next<br class="">
-<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ub4jMSDBchMgrgzlDu1vMiNypFnsfN%2FcPuZgqa7ZJk8%3D&reserved=0" originalsrc="https://gitlab.freedesktop.org/agd5f/linux" shash="M88fDQPQX7qZp1hhhF0nh4VXT81IPfGyj324sEjqTi0N9soYqoRNrrL+WNqZC6CdG6VuBT2t5fdbba9mtvS4lNU0SwnoUhOJ8Ak/F6AZnWJxUPLnAnIkCCo9ICvxu5/iaeWZcVF+3Itmrrb/i+qvpq1l5JMzuWuwMJYEZIefFQo=" class="" moz-do-not-send="true">https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4</a>%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&reserved=0,
after that you can try<br class="">
to continue working with ROCm enabling on top of that.<br class="">
<br class="">
For now i suggest you move on with Radeon 7 which as
your development<br class="">
ASIC and use the fix i mentioned above.<br class="">
<br class="">
</blockquote>
I finally got some time to continue on kfd hotplug patch
attempt.<br class="">
The following patch seems to work for kfd hotplug on
Radeon VII. After<br class="">
hot plugout, the tf process exists because of vm fault.<br class="">
A new tf process run without issues after plugback.<br class="">
<br class="">
It has the following fixes.<br class="">
<br class="">
1. ras sysfs regression;<br class="">
2. skip setting compute idle after dev is plugged,
otherwise it will<br class="">
try to write the pci bar thus driver fault<br class="">
3. stops the actual work of invalidate memory map
triggered by<br class="">
useptrs; (return false will trigger warning, so I
returned true.<br class="">
Not sure if it is correct)<br class="">
4. It sends exceptions to all the events/signal that a
“zombie”<br class="">
process that are waiting for. (Not sure if the
hw_exception is<br class="">
worthwhile, it did not do anything in my case since
there is such<br class="">
event type associated with that process)<br class="">
<br class="">
Please take a look and let me know if it acceptable.<br class="">
<br class="">
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
index 1f8161cd507f..2f7858692067 100644<br class="">
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
@@ -33,6 +33,7 @@<br class="">
#include <uapi/linux/kfd_ioctl.h><br class="">
#include "amdgpu_ras.h"<br class="">
#include "amdgpu_umc.h"<br class="">
+#include <drm/drm_drv.h><br class="">
<br class="">
/* Total memory size in system memory and all GPU VRAM.
Used to<br class="">
* estimate worst case amount of memory to reserve for
page tables<br class="">
@@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct
amdgpu_device<br class="">
*adev,<br class="">
<br class="">
void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device
*adev, bool<br class="">
idle)<br class="">
{<br class="">
- amdgpu_dpm_switch_power_profile(adev,<br class="">
- PP_SMC_POWER_PROFILE_COMPUTE,<br class="">
- !idle);<br class="">
+ if (!drm_dev_is_unplugged(adev_to_drm(adev)))<br class="">
+ amdgpu_dpm_switch_power_profile(adev,<br class="">
+ PP_SMC_POWER_PROFILE_COMPUTE,<br class="">
+ !idle);<br class="">
}<br class="">
<br class="">
bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
u32 vmid)<br class="">
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
index 4b153daf283d..fb4c9e55eace 100644<br class="">
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
@@ -46,6 +46,7 @@<br class="">
#include <linux/firmware.h><br class="">
#include <linux/module.h><br class="">
#include <drm/drm.h><br class="">
+#include <drm/drm_drv.h><br class="">
<br class="">
#include "amdgpu.h"<br class="">
#include "amdgpu_amdkfd.h"<br class="">
@@ -104,6 +105,9 @@ static bool
amdgpu_mn_invalidate_hsa(struct<br class="">
mmu_interval_notifier *mni,<br class="">
struct amdgpu_bo *bo = container_of(mni, struct
amdgpu_bo,<br class="">
notifier);<br class="">
struct amdgpu_device *adev =
amdgpu_ttm_adev(bo->tbo.bdev);<br class="">
<br class="">
+ if (drm_dev_is_unplugged(adev_to_drm(adev)))<br class="">
+ return true;<br class="">
+<br class="">
</blockquote>
</div>
</blockquote>
Label: Fix 3<br class="">
<blockquote type="cite" class="">
<div class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
if (!mmu_notifier_range_blockable(range))<br class="">
return false;<br class="">
<br class="">
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
index cac56f830aed..fbbaaabf3a67 100644<br class="">
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
@@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct<br class="">
amdgpu_device *adev)<br class="">
}<br class="">
}<br class="">
<br class="">
- amdgpu_ras_sysfs_remove_all(adev);<br class="">
return 0;<br class="">
}<br class="">
/* ras fs end */<br class="">
@@ -2557,8 +2556,6 @@ void
amdgpu_ras_block_late_fini(struct<br class="">
amdgpu_device *adev,<br class="">
if (!ras_block)<br class="">
return;<br class="">
<br class="">
- amdgpu_ras_sysfs_remove(adev, ras_block);<br class="">
-<br class="">
ras_obj = container_of(ras_block, struct<br class="">
amdgpu_ras_block_object, ras_comm);<br class="">
if (ras_obj->ras_cb)<br class="">
amdgpu_ras_interrupt_remove_handler(adev,
ras_block);<br class="">
@@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct
amdgpu_device *adev)<br class="">
/* Need disable ras on all IPs here before ip
[hw/sw]fini */<br class="">
amdgpu_ras_disable_all_features(adev, 0);<br class="">
amdgpu_ras_recovery_fini(adev);<br class="">
+ amdgpu_ras_sysfs_remove_all(adev);<br class="">
return 0;<br class="">
}<br class="">
<br class="">
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
index f1a225a20719..4b789bec9670 100644<br class="">
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
@@ -714,16 +714,37 @@ bool kfd_is_locked(void)<br class="">
<br class="">
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)<br class="">
{<br class="">
+ struct kfd_process *p;<br class="">
+ struct amdkfd_process_info *p_info;<br class="">
+ unsigned int temp;<br class="">
+<br class="">
if (!kfd->init_complete)<br class="">
return;<br class="">
<br class="">
/* for runtime suspend, skip locking kfd */<br class="">
- if (!run_pm) {<br class="">
+ if (!run_pm &&
!drm_dev_is_unplugged(kfd->ddev)) {<br class="">
/* For first KFD device suspend all the KFD
processes */<br class="">
if (atomic_inc_return(&kfd_locked) ==
1)<br class="">
kfd_suspend_all_processes();<br class="">
}<br class="">
<br class="">
+ if (drm_dev_is_unplugged(kfd->ddev)){<br class="">
+ int idx =
srcu_read_lock(&kfd_processes_srcu);<br class="">
+ pr_debug("cancel restore_userptr_work\n");<br class="">
+ hash_for_each_rcu(kfd_processes_table,
temp, p,<br class="">
kfd_processes) {<br class="">
+ if
(kfd_process_gpuidx_from_gpuid(p, kfd->id)<br class="">
>= 0) {<br class="">
+ p_info =
p->kgd_process_info;<br class="">
+ pr_debug("cancel
processes, pid = %d<br class="">
for gpu_id = %d", pid_nr(p_info->pid), kfd->id);<br class="">
+
cancel_delayed_work_sync(&p_info->restore_userptr_work);<br class="">
</blockquote>
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">Is this really necessary? If it is,
there are probably other workers,</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">e.g. related to our SVM code, that
would need to be canceled as well.</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
</div>
</blockquote>
<div><br class="">
</div>
<div>I delete this and it seems to be OK. It was previously
added to suppress restore_useptr_work which keeps updating
PTE.</div>
<div>Now this is gone by Fix 3. Please let us know if it is OK:)
@Felix</div>
<div><br class="">
</div>
<blockquote type="cite" class="">
<div class=""><br style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
+<br class="">
+ /* send exception signals to the kfd<br class="">
events waiting in user space */<br class="">
+ kfd_signal_hw_exception_event(p->pasid);<br class="">
</blockquote>
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">This makes sense. It basically tells
user mode that the application's</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">GPU state is lost due to a RAS error
or a GPU reset, or now a GPU</span><br style="caret-color:
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">hot-unplug.</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
</div>
</blockquote>
<div><br class="">
</div>
<div>The problem is that it cannot find an event with a type
that matches HW_EXCEPTION_TYPE so it does **nothing** from the
driver with the default parameter value of send_sigterm =
false;</div>
<div>After all, if a “zombie” process (zombie in the sense it
does not have a GPU dev) does not exit, kfd resources seems
not been released properly and new kfd process cannot run
after plug back.</div>
<div>(I still need to look hard into rocr/hsakmt/kfd driver code
to understand the reason. At least I am seeing that the kfd
topology won’t be cleaned up without process exiting, so that
there would be a “zombie" kfd node in the topology, which may
or may not cause issues in hsakmt). </div>
<div>@Felix Do you have suggestion/insight on this “zombie"
process issue? @Andrey suggests it should be OK to have a
“zombie” kfd process and a “zombie” kfd dev, and the new kfd
process should be ok to run on the new kfd dev after plugback.</div>
</div>
</blockquote>
<p><br>
</p>
<p>My experience with the graphic stack at least showed that. At
least in a setup with 2 GPUs, if i remove a secondary GPU which
had a rendering process on it, I could plug back the secondary GPU
and start a new rendering process while the old zombie process was
still present. It could be that in KFD case there are some
obstacles to this that need to be resolved.<br>
</p>
<p>Andrey</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:E51808D5-5E34-420C-9CBD-F2BAE26E45F5@microsoft.com">
<div>
<div><br class="">
</div>
<div>
<div>May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu:
cancel restore_userptr_work</div>
<div>May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu:
sending hw exception to pasid = 0x800</div>
<div>May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd:
amdgpu: Process 25894 (pasid 0x8001) got unhandled exception</div>
<div><br class="">
</div>
</div>
<blockquote type="cite" class="">
<div class=""><br style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
+ kfd_signal_vm_fault_event(kfd, p->pasid, NULL);<br class="">
</blockquote>
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">This does not make sense. A VM fault
indicates an access to a bad</span><br style="caret-color:
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">virtual address by the GPU. If a
debugger is attached to the process, it</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">notifies the debugger to investigate
what went wrong. If the GPU is</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">gone, that doesn't make any sense.
There is no GPU that could have</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">issued a bad memory request. And the
debugger won't be happy either to</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">find a VM fault from a GPU that
doesn't exist any more.</span><br style="caret-color:
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
</div>
</blockquote>
<div><br class="">
</div>
<div>OK understood.</div>
<br class="">
<blockquote type="cite" class="">
<div class=""><br style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">If the HW-exception event doesn't
terminate your process, we may need to</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">look into how ROCr handles the
HW-exception events.</span><br style="caret-color: rgb(0,
0, 0); font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
+ }<br class="">
+ }<br class="">
+ srcu_read_unlock(&kfd_processes_srcu, idx);<br class="">
+ }<br class="">
+<br class="">
kfd->dqm->ops.stop(kfd->dqm);<br class="">
kfd_iommu_suspend(kfd);<br class="">
</blockquote>
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">Should DQM stop and IOMMU suspend
still be executed? Or should the</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">hot-unplug case short-circuit them?</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
</div>
</blockquote>
<div><br class="">
</div>
<div>I tried short circuiting them, but would later caused BUG
related to GPU reset. I added the following that solve the
issue on plugout. </div>
<div>
<div><br class="">
</div>
<div>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>index b583026dc893..d78a06d74759 100644</div>
<div>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>@@ -5317,7 +5317,8 @@ static void
amdgpu_device_queue_gpu_recover_work(struct work_struct
*work)</div>
<div> {</div>
<div> struct amdgpu_recover_work_struct *recover_work =
container_of(work, struct amdgpu_recover_work_struct, base);</div>
<div><br class="">
</div>
<div>- recover_work->ret =
amdgpu_device_gpu_recover_imp(recover_work->adev,
recover_work->job);</div>
<div>+ if
(!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))</div>
<div>+ recover_work->ret =
amdgpu_device_gpu_recover_imp(recover_work->adev,
recover_work->job);</div>
<div> }</div>
<div> /*</div>
<div> * Serialize gpu recover into reset domain single
threaded wq</div>
<div><br class="">
</div>
</div>
<div>However after kill the zombie process, it failed to evict
queues of the process.</div>
<div><br class="">
</div>
<div>
<div>[ +0.000002] amdgpu: writing 263 to doorbell address
00000000c86e63f2</div>
<div>[ +9.002503] amdgpu: qcm fence wait loop timeout expired</div>
<div>[ +0.001364] amdgpu: The cp might be in an unrecoverable
state due to an unsuccessful queues preemption</div>
<div>[ +0.001343] amdgpu: Failed to evict process queues</div>
<div>[ +0.001355] amdgpu: Failed to evict queues of pasid
0x8001</div>
<div class=""><br class="">
</div>
</div>
<div><br class="">
</div>
<div>This would cause driver BUG triggered by new kfd process
after plugback. I am pasting the errors from dmesg after
plugback as below.</div>
<div class=""><br class="">
</div>
<div><br class="">
</div>
<div><br class="">
</div>
<div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.445332] amdgpu:
Evicting PASID 0x8001 queues</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.445359] BUG:
unable to handle page fault for address: 000000020000006e</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.447516] #PF:
supervisor read access in kernel mode</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.449627] #PF:
error_code(0x0000) - not-present page</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.451661] PGD
80000020892a8067 P4D 80000020892a8067 PUD 0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.453741] Oops:
0000 [#1] PREEMPT SMP PTI</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.455904] CPU: 25
PID: 9236 Comm: tf_cnn_benchmar Tainted: G W OE
5.16.0+ #3</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.457406] amdgpu
0000:05:00.0: amdgpu: GPU reset begin!</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.457798] Hardware
name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test
BIOS] 10/002/2015</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.461458] RIP:
0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.465238] Code: bd
13 8a dd 85 c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49
39 de 75 11 e9 8d 00 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00
00 <80> 7b 6e 00 c6 43 6d 01 74 ea c6 43 6e 00 41 83
ac 24 70 01 00 00</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.470516] RSP:
0018:ffffb2674c8afbf0 EFLAGS: 00010203</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.473255] RAX:
ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.475691] RDX:
0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.478564] RBP:
ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.481409] R10:
00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.484254] R13:
ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.487184] FS:
00007fe62124a700(0000) GS:ffff92053fd00000(0000)
knlGS:0000000000000000</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.490308] CS: 0010
DS: 0000 ES: 0000 CR0: 0000000080050033</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.493122] CR2:
000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.496142] Call
Trace:</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.499199]
<TASK></div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.502261]
kfd_process_evict_queues+0x43/0xf0 [amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.506378]
kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.510539]
amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.514110]
amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.518247]
__mmu_notifier_invalidate_range_start+0x136/0x1e0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.521252]
change_protection+0x41d/0xcd0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.524310]
change_prot_numa+0x19/0x30</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.527366]
task_numa_work+0x1ca/0x330</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.530157]
task_work_run+0x6c/0xa0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.533124]
exit_to_user_mode_prepare+0x1af/0x1c0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.536058]
syscall_exit_to_user_mode+0x2a/0x40</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.538989]
do_syscall_64+0x46/0xb0</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.541830]
entry_SYSCALL_64_after_hwframe+0x44/0xae</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.544701] RIP:
0033:0x7fe6585ec317</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.547297] Code: b3
66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff
ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00
f7 d8 64 89 01 48</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.553183] RSP:
002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.556105] RAX:
ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.558970] RDX:
00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.561950] RBP:
00007fe621249540 R08: 0000000000000000 R09: 0000000000040000</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.564563] R10:
00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.567494] R13:
0000000000000003 R14: 0000000000000064 R15: 00007fe621249920</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.570470]
</TASK></div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.573380] Modules
linked in: amdgpu(OE) veth nf_conntrack_netlink nfnetlink
xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter overlay
esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr
intel_rapl_common sb_edac snd_hda_codec_hdmi
x86_pkg_temp_thermal snd_hda_intel intel_powerclamp
snd_intel_dspcfg ipmi_ssif coretemp snd_hda_codec kvm_intel
snd_hda_core snd_hwdep kvm snd_pcm snd_timer snd soundcore
ftdi_sio irqbypass rapl intel_cstate usbserial joydev mei_me
input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si
ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler
sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
x_tables autofs4 btrfs blake2b_generic zstd_compress raid10
raid456</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.573543]
async_raid6_recov async_memcpy async_pq async_xor async_tx
xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2
gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
drm_kms_helper syscopyarea hid_generic crct10dif_pclmul
crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt
fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid
drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit
wmi [last unloaded: amdgpu]</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.611083] CR2:
000000020000006e</div>
<div>May 11 10:25:16 NETSYS26 kernel: [ 688.614454] ---[ end
trace 349cf28efb6268bc ]—</div>
<div><br class="">
</div>
<div>Looking forward to the comments.</div>
<div><br class="">
</div>
<div>Regards,</div>
<div>Shuotao</div>
<div><br class="">
</div>
</div>
<blockquote type="cite" class="">
<div class=""><br style="caret-color: rgb(0, 0, 0);
font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">Regards,</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<span style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; float: none; display: inline
!important;" class="">Felix</span><br style="caret-color:
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<br style="caret-color: rgb(0, 0, 0); font-family:
Helvetica; font-size: 12px; font-style: normal;
font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;" class="">
<blockquote type="cite" style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; text-decoration: none;" class="">
}<br class="">
<br class="">
Regards,<br class="">
Shuotao<br class="">
<blockquote type="cite" class=""><br class="">
Andrey<br class="">
<br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
Regards,<br class="">
Shuotao<br class="">
<br class="">
[ +0.001645] BUG: unable to handle page fault for
address:<br class="">
0000000000058a68<br class="">
[ +0.001298] #PF: supervisor read access in kernel
mode<br class="">
[ +0.001252] #PF: error_code(0x0000) - not-present
page<br class="">
[ +0.001248] PGD 8000000115806067 P4D
8000000115806067 PUD<br class="">
109b2d067 PMD 0<br class="">
[ +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI<br class="">
[ +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar
Tainted: G<br class="">
W E 5.16.0+ #3<br class="">
[ +0.001290] Hardware name: Dell Inc. PowerEdge
R730/0H21J3, BIOS<br class="">
1.5.4 [FPGA Test BIOS] 10/002/2015<br class="">
[ +0.001309] RIP:
0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]<br class="">
[ +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f
3f 75 ae 0f 1f<br class="">
44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75
0d 4c 03 a3 a0<br class="">
09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01
00 4c 89 f7 e8 a2 4c<br class="">
2e ca 85<br class="">
[ +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS:
00010202<br class="">
[ +0.001388] RAX: ffffffffc09a4270 RBX:
ffff8b0c9c840000 RCX:<br class="">
00000000ffffffff<br class="">
[ +0.001402] RDX: 0000000000000000 RSI:
000000000001629a RDI:<br class="">
ffff8b0c9c840000<br class="">
[ +0.001418] RBP: ffffb58fac313948 R08:
0000000000000021 R09:<br class="">
0000000000000001<br class="">
[ +0.001421] R10: ffffb58fac313b30 R11:
ffffffff8c065b00 R12:<br class="">
0000000000058a68<br class="">
[ +0.001400] R13: 000000000001629a R14:
0000000000000000 R15:<br class="">
000000000001629a<br class="">
[ +0.001397] FS: 0000000000000000(0000)
GS:ffff8b4b7fa80000(0000)<br class="">
knlGS:0000000000000000<br class="">
[ +0.001411] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033<br class="">
[ +0.001405] CR2: 0000000000058a68 CR3:
000000010a2c8001 CR4:<br class="">
00000000001706e0<br class="">
[ +0.001422] Call Trace:<br class="">
[ +0.001407] <TASK><br class="">
[ +0.001391] amdgpu_device_rreg+0x17/0x20 [amdgpu]<br class="">
[ +0.001614] amdgpu_cgs_read_register+0x14/0x20
[amdgpu]<br class="">
[ +0.001735]
phm_wait_for_register_unequal.part.1+0x58/0x90
[amdgpu]<br class="">
[ +0.001790] phm_wait_for_register_unequal+0x1a/0x30
[amdgpu]<br class="">
[ +0.001800] vega20_wait_for_response+0x28/0x80
[amdgpu]<br class="">
[ +0.001757]
vega20_send_msg_to_smc_with_parameter+0x21/0x110
[amdgpu]<br class="">
[ +0.001838]
smum_send_msg_to_smc_with_parameter+0xcd/0x100
[amdgpu]<br class="">
[ +0.001829] ? kvfree+0x1e/0x30<br class="">
[ +0.001462]
vega20_set_power_profile_mode+0x58/0x330 [amdgpu]<br class="">
[ +0.001868] ? kvfree+0x1e/0x30<br class="">
[ +0.001462] ? ttm_bo_release+0x261/0x370 [ttm]<br class="">
[ +0.001467] pp_dpm_switch_power_profile+0xc2/0x170
[amdgpu]<br class="">
[ +0.001863]
amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]<br class="">
[ +0.001866]
amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]<br class="">
[ +0.001784] kfd_dec_compute_active+0x2c/0x50
[amdgpu]<br class="">
[ +0.001744] process_termination_cpsch+0x2f9/0x3a0
[amdgpu]<br class="">
[ +0.001728]
kfd_process_dequeue_from_all_devices+0x49/0x70
[amdgpu]<br class="">
[ +0.001730] kfd_process_notifier_release+0x91/0xe0
[amdgpu]<br class="">
[ +0.001718] __mmu_notifier_release+0x77/0x1f0<br class="">
[ +0.001411] exit_mmap+0x1b5/0x200<br class="">
[ +0.001396] ? __switch_to+0x12d/0x3e0<br class="">
[ +0.001388] ? __switch_to_asm+0x36/0x70<br class="">
[ +0.001372] ? preempt_count_add+0x74/0xc0<br class="">
[ +0.001364] mmput+0x57/0x110<br class="">
[ +0.001349] do_exit+0x33d/0xc20<br class="">
[ +0.001337] ? _raw_spin_unlock+0x1a/0x30<br class="">
[ +0.001346] do_group_exit+0x43/0xa0<br class="">
[ +0.001341] get_signal+0x131/0x920<br class="">
[ +0.001295] arch_do_signal_or_restart+0xb1/0x870<br class="">
[ +0.001303] ? do_futex+0x125/0x190<br class="">
[ +0.001285] exit_to_user_mode_prepare+0xb1/0x1c0<br class="">
[ +0.001282] syscall_exit_to_user_mode+0x2a/0x40<br class="">
[ +0.001264] do_syscall_64+0x46/0xb0<br class="">
[ +0.001236]
entry_SYSCALL_64_after_hwframe+0x44/0xae<br class="">
[ +0.001219] RIP: 0033:0x7f6aff1d2ad3<br class="">
[ +0.001177] Code: Unable to access opcode bytes at
RIP 0x7f6aff1d2aa9.<br class="">
[ +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS:
00000246 ORIG_RAX:<br class="">
00000000000000ca<br class="">
[ +0.001170] RAX: fffffffffffffe00 RBX:
0000000004f542b0 RCX:<br class="">
00007f6aff1d2ad3<br class="">
[ +0.001168] RDX: 0000000000000000 RSI:
0000000000000080 RDI:<br class="">
0000000004f542d8<br class="">
[ +0.001162] RBP: 0000000004f542d4 R08:
0000000000000000 R09:<br class="">
0000000000000000<br class="">
[ +0.001152] R10: 0000000000000000 R11:
0000000000000246 R12:<br class="">
0000000004f542d8<br class="">
[ +0.001176] R13: 0000000000000000 R14:
0000000004f54288 R15:<br class="">
0000000000000000<br class="">
[ +0.001152] </TASK><br class="">
[ +0.001113] Modules linked in: veth amdgpu(E)
nf_conntrack_netlink<br class="">
nfnetlink xfrm_user xt_addrtype br_netfilter
xt_CHECKSUM<br class="">
iptable_mangle xt_MASQUERADE iptable_nat nf_nat
xt_conntrack<br class="">
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT
nf_reject_ipv4<br class="">
xt_tcpudp bridge stp llc ebtable_filter ebtables
ip6table_filter<br class="">
ip6_tables iptable_filter overlay esp6_offload esp6
esp4_offload<br class="">
esp4 xfrm_algo intel_rapl_msr intel_rapl_common
sb_edac<br class="">
x86_pkg_temp_thermal intel_powerclamp
snd_hda_codec_hdmi<br class="">
snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp
snd_hda_codec<br class="">
kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd
kvm soundcore<br class="">
irqbypass ftdi_sio usbserial input_leds iTCO_wdt
iTCO_vendor_support<br class="">
joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si
ipmi_devintf<br class="">
ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel
ib_iser<br class="">
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
libiscsi<br class="">
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs<br class="">
blake2b_generic zstd_compress raid10 raid456<br class="">
[ +0.000102] async_raid6_recov async_memcpy async_pq
async_xor<br class="">
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
linear<br class="">
iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm
drm_shmem_helper<br class="">
drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops<br class="">
crct10dif_pclmul hid_generic crc32_pclmul
ghash_clmulni_intel usbhid<br class="">
uas aesni_intel crypto_simd igb ahci hid drm
usb_storage cryptd<br class="">
libahci dca megaraid_sas i2c_algo_bit wmi [last
unloaded: amdgpu]<br class="">
[ +0.016626] CR2: 0000000000058a68<br class="">
[ +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---<br class="">
[ +0.024953] RIP:
0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]<br class="">
[ +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f
3f 75 ae 0f 1f<br class="">
44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75
0d 4c 03 a3 a0<br class="">
09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01
00 4c 89 f7 e8 a2 4c<br class="">
2e ca 85<br class="">
[ +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS:
00010202<br class="">
[ +0.001641] RAX: ffffffffc09a4270 RBX:
ffff8b0c9c840000 RCX:<br class="">
00000000ffffffff<br class="">
[ +0.001656] RDX: 0000000000000000 RSI:
000000000001629a RDI:<br class="">
ffff8b0c9c840000<br class="">
[ +0.001681] RBP: ffffb58fac313948 R08:
0000000000000021 R09:<br class="">
0000000000000001<br class="">
[ +0.001662] R10: ffffb58fac313b30 R11:
ffffffff8c065b00 R12:<br class="">
0000000000058a68<br class="">
[ +0.001650] R13: 000000000001629a R14:
0000000000000000 R15:<br class="">
000000000001629a<br class="">
[ +0.001648] FS: 0000000000000000(0000)
GS:ffff8b4b7fa80000(0000)<br class="">
knlGS:0000000000000000<br class="">
[ +0.001668] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033<br class="">
[ +0.001673] CR2: 0000000000058a68 CR3:
000000010a2c8001 CR4:<br class="">
00000000001706e0<br class="">
[ +0.001740] Fixing recursive fault but reboot is
needed!<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Apr 21, 2022, at
2:41 AM, Andrey Grodzovsky<br class="">
<<a href="mailto:andrey.grodzovsky@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">andrey.grodzovsky@amd.com</a>>
wrote:<br class="">
<br class="">
I retested hot plug tests at the commit I mentioned
bellow - looks<br class="">
ok, my ASIC is Navi 10, I also tested using Vega 10
and older<br class="">
Polaris ASICs (whatever i had at home at the time).
It's possible<br class="">
there are extra issues in ASICs like ur which I
didn't cover during<br class="">
tests.<br class="">
<br class="">
andrey@andrey-test:~/drm$ sudo
./build/tests/amdgpu/amdgpu_test -s 13<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
<br class="">
<br class="">
The ASIC NOT support UVD, suite disabled<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
<br class="">
<br class="">
The ASIC NOT support VCE, suite disabled<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
<br class="">
<br class="">
The ASIC NOT support UVD ENC, suite disabled.<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
<br class="">
<br class="">
Don't support TMZ (trust memory zone), security
suite disabled<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
/usr/local/share/libdrm/amdgpu.ids: No such file or
directory<br class="">
Peer device is not opened or has ASIC not supported
by the suite,<br class="">
skip all Peer to Peer tests.<br class="">
<br class="">
<br class="">
CUnit - A unit testing framework for C - Version
2.1-3<br class="">
<a href="https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kzNRa9d46sBwZCVhu9%2BEkK%2F3f7fyjAo%2BAADtgeoz2l8%3D&reserved=0" originalsrc="http://cunit.sourceforge.net/" shash="mz6Kzjf7NojqeE9BGVLrvEm3IyJe7NwKrHZoxg1rRxeFOTkcFC28UF09ES/2elRxC+ERNKHkdboZ4W5DbH9EHgOogBx8slEYJBRuLvkvHgddsx1Dp6ZmWcjLh8Wnq/56zpfAo1K0ihxSqsuFZ6G6ZtfXiggyJfwGpMRMoWAhcyo=" class="" moz-do-not-send="true">https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&reserved=0</a><br class="">
<br class="">
<br class="">
*Suite: Hotunplug Tests**<br class="">
** Test: Unplug card and rescan the bus to plug it
back<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file
or directory**<br class="">
**passed**<br class="">
** Test: Same as first test but with command
submission<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file
or directory**<br class="">
**passed**<br class="">
** Test: Unplug with exported bo<br class="">
.../usr/local/share/libdrm/amdgpu.ids: No such file
or directory**<br class="">
**passed*<br class="">
<br class="">
Run Summary: Type Total Ran Passed Failed Inactive<br class="">
suites 14 1 n/a 0 0<br class="">
tests 71 3 3 0 1<br class="">
asserts 21 21 21 0 n/a<br class="">
<br class="">
Elapsed time = 9.195 seconds<br class="">
<br class="">
<br class="">
Andrey<br class="">
<br class="">
On 2022-04-20 11:44, Andrey Grodzovsky wrote:<br class="">
<blockquote type="cite" class=""><br class="">
The only one in Radeon 7 I see is the same sysfs
crash we already<br class="">
fixed so you can use the same fix. The MI 200
issue i haven't seen<br class="">
yet but I also haven't tested MI200 so never saw
it before. Need<br class="">
to test when i get the time.<br class="">
<br class="">
So try that fix with Radeon 7 again to see if you
pass the tests<br class="">
(the warnings should all be minor issues).<br class="">
<br class="">
Andrey<br class="">
<br class="">
<br class="">
On 2022-04-20 05:24, Shuotao Xu wrote:<br class="">
<blockquote type="cite" class="">
<blockquote type="cite" class=""><br class="">
That a problem, latest working baseline I
tested and confirmed<br class="">
passing hotplug tests is this branch and<br class="">
<a href="commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0" class="" moz-do-not-send="true">commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0</a><br class="">
is amd-staging-drm-next. 5.14 was the branch
we ups-reamed the<br class="">
hotplug code but it had a lot of regressions
over time due to<br class="">
new changes (that why I added the hotplug test
to try and catch<br class="">
them early). It would be best to run this
branch on mi-100 so we<br class="">
have a clean baseline and only after
confirming this particular<br class="">
branch from this commits passes libdrm tests
only then start<br class="">
adding the KFD specific addons. Another option
if you can't work<br class="">
with MI-100 and this branch is to try a
different ASIC that does<br class="">
work with this branch (if possible).<br class="">
<br class="">
Andrey<br class="">
<br class="">
</blockquote>
OK I tried both this commit and the HEAD of
and-staging-drm-next<br class="">
on two GPUs( MI100 and Radeon VII) both did not
pass hotplugout<br class="">
libdrm test. I might be able to gain access to
MI200, but I<br class="">
suspect it would work.<br class="">
<br class="">
I copied the complete dmesgs as follows. I
highlighted the OOPSES<br class="">
for you.<br class="">
<br class="">
Radeon VII:</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</div>
</blockquote>
</div>
<br class="">
</blockquote>
</body>
</html>