<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2022-05-10 23:35, Shuotao Xu wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:E51808D5-5E34-420C-9CBD-F2BAE26E45F5@microsoft.com">
      
      <br class="">
      <div><br class="">
        <blockquote type="cite" class="">
          <div class="">On May 11, 2022, at 4:31 AM, Felix Kuehling <<a href="mailto:felix.kuehling@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">felix.kuehling@amd.com</a>>
            wrote:</div>
          <br class="Apple-interchange-newline">
          <div class=""><span style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">[Some people who received this
              message don't often get email from<span class="Apple-converted-space"> </span></span><a href="mailto:felix.kuehling@amd.com" style="font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; orphans: auto; text-align: start;
              text-indent: 0px; text-transform: none; white-space:
              normal; widows: auto; word-spacing: 0px;
              -webkit-text-size-adjust: auto; -webkit-text-stroke-width:
              0px;" class="moz-txt-link-freetext" moz-do-not-send="true">felix.kuehling@amd.com</a><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;
              float: none; display: inline !important;" class="">. Learn
              why this is important at<span class="Apple-converted-space"> </span></span><a href="https://aka.ms/LearnAboutSenderIdentification" style="font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; orphans: auto;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; widows: auto; word-spacing: 0px;
              -webkit-text-size-adjust: auto; -webkit-text-stroke-width:
              0px;" class="moz-txt-link-freetext" moz-do-not-send="true">https://aka.ms/LearnAboutSenderIdentification</a><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;
              float: none; display: inline !important;" class="">.]</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">Am 2022-05-10 um 07:03 schrieb
              Shuotao Xu:</span><br style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
              <br class="">
              <br class="">
              <blockquote type="cite" class="">On Apr 28, 2022, at 12:04
                AM, Andrey Grodzovsky<br class="">
                <<a href="mailto:andrey.grodzovsky@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">andrey.grodzovsky@amd.com</a>>
                wrote:<br class="">
                <br class="">
                On 2022-04-27 05:20, Shuotao Xu wrote:<br class="">
                <br class="">
                <blockquote type="cite" class="">Hi Andrey,<br class="">
                  <br class="">
                  Sorry that I did not have time to work on this for a
                  few days.<br class="">
                  <br class="">
                  I just tried the sysfs crash fix on Radeon VII and it
                  seems that it<br class="">
                  worked. It did not pass last the hotplug test, but my
                  version has 4<br class="">
                  tests instead of 3 in your case.<br class="">
                </blockquote>
                <br class="">
                <br class="">
                That because the 4th one is only enabled when here are 2
                cards in the<br class="">
                system - to test DRI_PRIME export. I tested this time
                with only one card.<br class="">
                <br class="">
              </blockquote>
              Yes, I only had one Radeon VII in my system, so this 4th
              test should<br class="">
              have been skipped. I am ignoring this issue.<br class="">
              <br class="">
              <blockquote type="cite" class="">
                <blockquote type="cite" class=""><br class="">
                  <br class="">
                  Suite: Hotunplug Tests<br class="">
                  Test: Unplug card and rescan the bus to plug it back<br class="">
                  .../usr/local/share/libdrm/amdgpu.ids: No such file or
                  directory<br class="">
                  passed<br class="">
                  Test: Same as first test but with command submission<br class="">
                  .../usr/local/share/libdrm/amdgpu.ids: No such file or
                  directory<br class="">
                  passed<br class="">
                  Test: Unplug with exported bo<br class="">
                  .../usr/local/share/libdrm/amdgpu.ids: No such file or
                  directory<br class="">
                  passed<br class="">
                  Test: Unplug with exported fence<br class="">
                  .../usr/local/share/libdrm/amdgpu.ids: No such file or
                  directory<br class="">
                  amdgpu_device_initialize: amdgpu_get_auth (1) failed
                  (-1)<br class="">
                </blockquote>
                <br class="">
                <br class="">
                on the kernel side - the IOCTlL returning this is
                drm_getclient -<br class="">
                maybe take a look while it can't find client it ? I
                didn't have such<br class="">
                issue as far as I remember when testing.<br class="">
                <br class="">
                <br class="">
                <blockquote type="cite" class="">FAILED<br class="">
                  1. ../tests/amdgpu/hotunplug_tests.c:368 -
                  CU_ASSERT_EQUAL(r,0)<br class="">
                  2. ../tests/amdgpu/hotunplug_tests.c:411 -<br class="">
                  CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2,
                  shared_fd,<br class="">
                  &sync_obj_handle2),0)<br class="">
                  3. ../tests/amdgpu/hotunplug_tests.c:423 -<br class="">
                  CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2,
                  &sync_obj_handle2,<br class="">
                  1, 100000000, 0, NULL),0)<br class="">
                  4. ../tests/amdgpu/hotunplug_tests.c:425 -<br class="">
                  CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2,
                  sync_obj_handle2),0)<br class="">
                  <br class="">
                  Run Summary: Type Total Ran Passed Failed Inactive<br class="">
                  suites 14 1 n/a 0 0<br class="">
                  tests 71 4 3 1 0<br class="">
                  asserts 39 39 35 4 n/a<br class="">
                  <br class="">
                  Elapsed time = 17.321 seconds<br class="">
                  <br class="">
                  For kfd compute, there is some problem which I did not
                  see in MI100<br class="">
                  after I killed the hung application after hot plugout.
                  I was using<br class="">
                  rocm5.0.2 driver for MI100 card, and not sure if it is
                  a regression<br class="">
                  from the newer driver.<br class="">
                  After pkill, one of child of user process would be
                  stuck in Zombie<br class="">
                  mode (Z) understandably because of the bug, and future
                  rocm<br class="">
                  application after plug-back would in uninterrupted
                  sleep mode (D)<br class="">
                  because it would not return from syscall to kfd.<br class="">
                  <br class="">
                  Although drm test for amdgpu would run just fine
                  without issues<br class="">
                  after plug-back with dangling kfd state.<br class="">
                </blockquote>
                <br class="">
                <br class="">
                I am not clear when the crash bellow happens ? Is it
                related to what<br class="">
                you describe above ?<br class="">
                <br class="">
                <br class="">
                <blockquote type="cite" class=""><br class="">
                  I don’t know if there is a quick fix to it. I was
                  thinking add<br class="">
                  drm_enter/drm_exit to amdgpu_device_rreg.<br class="">
                </blockquote>
                <br class="">
                <br class="">
                Try adding drm_dev_enter/exit pair at the highest level
                of attmetong<br class="">
                to access HW - in this case it's
                amdgpu_amdkfd_set_compute_idle. We<br class="">
                always try to avoid accessing any HW functions after
                backing device<br class="">
                is gone.<br class="">
                <br class="">
                <br class="">
                <blockquote type="cite" class="">Also this has been a
                  long time in my attempt to fix hotplug issue<br class="">
                  for kfd application.<br class="">
                  I don’t know 1) if I would be able to get to MI100
                  (fixing Radeon<br class="">
                  VII would mean something but MI100 is more important
                  for us); 2)<br class="">
                  what the direct of the patch to this issue will move
                  forward.<br class="">
                </blockquote>
                <br class="">
                <br class="">
                I will go to office tomorrow to pick up MI-100, With
                time and<br class="">
                priorities permitting I will then then try to test it
                and fix any<br class="">
                bugs such that it will be passing all hot plug libdrm
                tests at the<br class="">
                tip of public amd-staging-drm-next<br class="">
                -<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ub4jMSDBchMgrgzlDu1vMiNypFnsfN%2FcPuZgqa7ZJk8%3D&reserved=0" originalsrc="https://gitlab.freedesktop.org/agd5f/linux" shash="M88fDQPQX7qZp1hhhF0nh4VXT81IPfGyj324sEjqTi0N9soYqoRNrrL+WNqZC6CdG6VuBT2t5fdbba9mtvS4lNU0SwnoUhOJ8Ak/F6AZnWJxUPLnAnIkCCo9ICvxu5/iaeWZcVF+3Itmrrb/i+qvpq1l5JMzuWuwMJYEZIefFQo=" class="" moz-do-not-send="true">https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4</a>%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&amp;reserved=0,
                after that you can try<br class="">
                to continue working with ROCm enabling on top of that.<br class="">
                <br class="">
                For now i suggest you move on with Radeon 7 which as
                your development<br class="">
                ASIC and use the fix i mentioned above.<br class="">
                <br class="">
              </blockquote>
              I finally got some time to continue on kfd hotplug patch
              attempt.<br class="">
              The following patch seems to work for kfd hotplug on
              Radeon VII. After<br class="">
              hot plugout, the tf process exists because of vm fault.<br class="">
              A new tf process run without issues after plugback.<br class="">
              <br class="">
              It has the following fixes.<br class="">
              <br class="">
              1. ras sysfs regression;<br class="">
              2. skip setting compute idle after dev is plugged,
              otherwise it will<br class="">
                 try to write the pci bar thus driver fault<br class="">
              3. stops the actual work of invalidate memory map
              triggered by<br class="">
                 useptrs; (return false will trigger warning, so I
              returned true.<br class="">
                 Not sure if it is correct)<br class="">
              4. It sends exceptions to all the events/signal that a
              “zombie”<br class="">
                 process that are waiting for. (Not sure if the
              hw_exception is<br class="">
                 worthwhile, it did not do anything in my case since
              there is such<br class="">
                 event type associated with that process)<br class="">
              <br class="">
              Please take a look and let me know if it acceptable.<br class="">
              <br class="">
              diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
              index 1f8161cd507f..2f7858692067 100644<br class="">
              --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
              +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c<br class="">
              @@ -33,6 +33,7 @@<br class="">
              #include <uapi/linux/kfd_ioctl.h><br class="">
              #include "amdgpu_ras.h"<br class="">
              #include "amdgpu_umc.h"<br class="">
              +#include <drm/drm_drv.h><br class="">
              <br class="">
              /* Total memory size in system memory and all GPU VRAM.
              Used to<br class="">
               * estimate worst case amount of memory to reserve for
              page tables<br class="">
              @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct
              amdgpu_device<br class="">
              *adev,<br class="">
              <br class="">
              void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device
              *adev, bool<br class="">
              idle)<br class="">
              {<br class="">
              -       amdgpu_dpm_switch_power_profile(adev,<br class="">
              - PP_SMC_POWER_PROFILE_COMPUTE,<br class="">
              -                                       !idle);<br class="">
              +       if (!drm_dev_is_unplugged(adev_to_drm(adev)))<br class="">
              +               amdgpu_dpm_switch_power_profile(adev,<br class="">
              + PP_SMC_POWER_PROFILE_COMPUTE,<br class="">
              +                                               !idle);<br class="">
              }<br class="">
              <br class="">
              bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
              u32 vmid)<br class="">
              diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
              index 4b153daf283d..fb4c9e55eace 100644<br class="">
              --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
              +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c<br class="">
              @@ -46,6 +46,7 @@<br class="">
              #include <linux/firmware.h><br class="">
              #include <linux/module.h><br class="">
              #include <drm/drm.h><br class="">
              +#include <drm/drm_drv.h><br class="">
              <br class="">
              #include "amdgpu.h"<br class="">
              #include "amdgpu_amdkfd.h"<br class="">
              @@ -104,6 +105,9 @@ static bool
              amdgpu_mn_invalidate_hsa(struct<br class="">
              mmu_interval_notifier *mni,<br class="">
                     struct amdgpu_bo *bo = container_of(mni, struct
              amdgpu_bo,<br class="">
              notifier);<br class="">
                     struct amdgpu_device *adev =
              amdgpu_ttm_adev(bo->tbo.bdev);<br class="">
              <br class="">
              +       if (drm_dev_is_unplugged(adev_to_drm(adev)))<br class="">
              +               return true;<br class="">
              +<br class="">
            </blockquote>
          </div>
        </blockquote>
        Label: Fix 3<br class="">
        <blockquote type="cite" class="">
          <div class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
                     if (!mmu_notifier_range_blockable(range))<br class="">
                             return false;<br class="">
              <br class="">
              diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
              index cac56f830aed..fbbaaabf3a67 100644<br class="">
              --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
              +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br class="">
              @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct<br class="">
              amdgpu_device *adev)<br class="">
                             }<br class="">
                     }<br class="">
              <br class="">
              -       amdgpu_ras_sysfs_remove_all(adev);<br class="">
                     return 0;<br class="">
              }<br class="">
              /* ras fs end */<br class="">
              @@ -2557,8 +2556,6 @@ void
              amdgpu_ras_block_late_fini(struct<br class="">
              amdgpu_device *adev,<br class="">
                     if (!ras_block)<br class="">
                             return;<br class="">
              <br class="">
              -       amdgpu_ras_sysfs_remove(adev, ras_block);<br class="">
              -<br class="">
                     ras_obj = container_of(ras_block, struct<br class="">
              amdgpu_ras_block_object, ras_comm);<br class="">
                     if (ras_obj->ras_cb)<br class="">
                             amdgpu_ras_interrupt_remove_handler(adev,
              ras_block);<br class="">
              @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct
              amdgpu_device *adev)<br class="">
                     /* Need disable ras on all IPs here before ip
              [hw/sw]fini */<br class="">
                     amdgpu_ras_disable_all_features(adev, 0);<br class="">
                     amdgpu_ras_recovery_fini(adev);<br class="">
              +       amdgpu_ras_sysfs_remove_all(adev);<br class="">
                     return 0;<br class="">
              }<br class="">
              <br class="">
              diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
              b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
              index f1a225a20719..4b789bec9670 100644<br class="">
              --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
              +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br class="">
              @@ -714,16 +714,37 @@ bool kfd_is_locked(void)<br class="">
              <br class="">
              void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)<br class="">
              {<br class="">
              +       struct kfd_process *p;<br class="">
              +       struct amdkfd_process_info *p_info;<br class="">
              +       unsigned int temp;<br class="">
              +<br class="">
                     if (!kfd->init_complete)<br class="">
                             return;<br class="">
              <br class="">
                     /* for runtime suspend, skip locking kfd */<br class="">
              -       if (!run_pm) {<br class="">
              +       if (!run_pm &&
              !drm_dev_is_unplugged(kfd->ddev)) {<br class="">
                             /* For first KFD device suspend all the KFD
              processes */<br class="">
                             if (atomic_inc_return(&kfd_locked) ==
              1)<br class="">
                                     kfd_suspend_all_processes();<br class="">
                     }<br class="">
              <br class="">
              +       if (drm_dev_is_unplugged(kfd->ddev)){<br class="">
              +               int idx =
              srcu_read_lock(&kfd_processes_srcu);<br class="">
              +               pr_debug("cancel restore_userptr_work\n");<br class="">
              +               hash_for_each_rcu(kfd_processes_table,
              temp, p,<br class="">
              kfd_processes) {<br class="">
              +                       if
              (kfd_process_gpuidx_from_gpuid(p, kfd->id)<br class="">
              >= 0) {<br class="">
              +                               p_info =
              p->kgd_process_info;<br class="">
              +                               pr_debug("cancel
              processes, pid = %d<br class="">
              for gpu_id = %d", pid_nr(p_info->pid), kfd->id);<br class="">
              +
              cancel_delayed_work_sync(&p_info->restore_userptr_work);<br class="">
            </blockquote>
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">Is this really necessary? If it is,
              there are probably other workers,</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">e.g. related to our SVM code, that
              would need to be canceled as well.</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
          </div>
        </blockquote>
        <div><br class="">
        </div>
        <div>I delete this and it seems to be OK. It was previously
          added to suppress restore_useptr_work which keeps updating
          PTE.</div>
        <div>Now this is gone by Fix 3. Please let us know if it is OK:)
          @Felix</div>
        <div><br class="">
        </div>
        <blockquote type="cite" class="">
          <div class=""><br style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
              +<br class="">
              + /* send exception signals to the kfd<br class="">
              events waiting in user space */<br class="">
              + kfd_signal_hw_exception_event(p->pasid);<br class="">
            </blockquote>
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">This makes sense. It basically tells
              user mode that the application's</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">GPU state is lost due to a RAS error
              or a GPU reset, or now a GPU</span><br style="caret-color:
              rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; text-align:
              start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">hot-unplug.</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
          </div>
        </blockquote>
        <div><br class="">
        </div>
        <div>The problem is that it cannot find an event with a type
          that matches HW_EXCEPTION_TYPE so it does **nothing** from the
          driver with the default parameter value of send_sigterm =
          false;</div>
        <div>After all, if a “zombie” process (zombie in the sense it
          does not have a GPU dev) does not exit, kfd resources seems
          not been released properly and new kfd process cannot run
          after plug back.</div>
        <div>(I still need to look hard into rocr/hsakmt/kfd driver code
          to understand the reason. At least I am seeing that the kfd
          topology won’t be cleaned up without process exiting, so that
          there would be a “zombie" kfd node in the topology, which may
          or may not cause issues in hsakmt). </div>
        <div>@Felix Do you have suggestion/insight on this “zombie"
          process issue? @Andrey suggests it should be OK to have a
          “zombie” kfd process and a “zombie” kfd dev, and the new kfd
          process should be ok to run on the new kfd dev after plugback.</div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>My experience with the graphic stack at least showed that. At
      least in a setup with 2 GPUs, if i remove a secondary GPU which
      had a rendering process on it, I could plug back the secondary GPU
      and start a new rendering process while the old zombie process was
      still present. It could be that in KFD case there are some
      obstacles to this that need to be resolved.<br>
    </p>
    <p>Andrey</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:E51808D5-5E34-420C-9CBD-F2BAE26E45F5@microsoft.com">
      <div>
        <div><br class="">
        </div>
        <div>
          <div>May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu:
            cancel restore_userptr_work</div>
          <div>May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu:
            sending hw exception to pasid = 0x800</div>
          <div>May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd:
            amdgpu: Process 25894 (pasid 0x8001) got unhandled exception</div>
          <div><br class="">
          </div>
        </div>
        <blockquote type="cite" class="">
          <div class=""><br style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
              + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);<br class="">
            </blockquote>
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">This does not make sense. A VM fault
              indicates an access to a bad</span><br style="caret-color:
              rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; text-align:
              start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">virtual address by the GPU. If a
              debugger is attached to the process, it</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">notifies the debugger to investigate
              what went wrong. If the GPU is</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">gone, that doesn't make any sense.
              There is no GPU that could have</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">issued a bad memory request. And the
              debugger won't be happy either to</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">find a VM fault from a GPU that
              doesn't exist any more.</span><br style="caret-color:
              rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; text-align:
              start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
          </div>
        </blockquote>
        <div><br class="">
        </div>
        <div>OK understood.</div>
        <br class="">
        <blockquote type="cite" class="">
          <div class=""><br style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">If the HW-exception event doesn't
              terminate your process, we may need to</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">look into how ROCr handles the
              HW-exception events.</span><br style="caret-color: rgb(0,
              0, 0); font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; text-align:
              start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
              + }<br class="">
              + }<br class="">
              + srcu_read_unlock(&kfd_processes_srcu, idx);<br class="">
              + }<br class="">
              +<br class="">
              kfd->dqm->ops.stop(kfd->dqm);<br class="">
              kfd_iommu_suspend(kfd);<br class="">
            </blockquote>
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">Should DQM stop and IOMMU suspend
              still be executed? Or should the</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">hot-unplug case short-circuit them?</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
          </div>
        </blockquote>
        <div><br class="">
        </div>
        <div>I tried short circuiting them, but would later caused BUG
          related to GPU reset. I added the following that solve the
          issue on plugout. </div>
        <div>
          <div><br class="">
          </div>
          <div>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>index b583026dc893..d78a06d74759 100644</div>
          <div>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>@@ -5317,7 +5317,8 @@ static void
            amdgpu_device_queue_gpu_recover_work(struct work_struct
            *work)</div>
          <div> {</div>
          <div>        struct amdgpu_recover_work_struct *recover_work =
            container_of(work, struct amdgpu_recover_work_struct, base);</div>
          <div><br class="">
          </div>
          <div>-       recover_work->ret =
            amdgpu_device_gpu_recover_imp(recover_work->adev,
            recover_work->job);</div>
          <div>+       if
            (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))</div>
          <div>+               recover_work->ret =
            amdgpu_device_gpu_recover_imp(recover_work->adev,
            recover_work->job);</div>
          <div> }</div>
          <div> /*</div>
          <div>  * Serialize gpu recover into reset domain single
            threaded wq</div>
          <div><br class="">
          </div>
        </div>
        <div>However after kill the zombie process, it failed to evict
          queues of the process.</div>
        <div><br class="">
        </div>
        <div>
          <div>[  +0.000002] amdgpu: writing 263 to doorbell address
            00000000c86e63f2</div>
          <div>[  +9.002503] amdgpu: qcm fence wait loop timeout expired</div>
          <div>[  +0.001364] amdgpu: The cp might be in an unrecoverable
            state due to an unsuccessful queues preemption</div>
          <div>[  +0.001343] amdgpu: Failed to evict process queues</div>
          <div>[  +0.001355] amdgpu: Failed to evict queues of pasid
            0x8001</div>
          <div class=""><br class="">
          </div>
        </div>
        <div><br class="">
        </div>
        <div>This would cause driver BUG triggered by new kfd process
          after plugback. I am pasting the errors from dmesg after
          plugback as below.</div>
        <div class=""><br class="">
        </div>
        <div><br class="">
        </div>
        <div><br class="">
        </div>
        <div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu:
            Evicting PASID 0x8001 queues</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG:
            unable to handle page fault for address: 000000020000006e</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF:
            supervisor read access in kernel mode</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF:
            error_code(0x0000) - not-present page</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD
            80000020892a8067 P4D 80000020892a8067 PUD 0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops:
            0000 [#1] PREEMPT SMP PTI</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25
            PID: 9236 Comm: tf_cnn_benchmar Tainted: G        W  OE    
            5.16.0+ #3</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu
            0000:05:00.0: amdgpu: GPU reset begin!</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware
            name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test
            BIOS] 10/002/2015</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP:
            0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd
            13 8a dd 85 c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49
            39 de 75 11 e9 8d 00 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00
            00 <80> 7b 6e 00 c6 43 6d 01 74 ea c6 43 6e 00 41 83
            ac 24 70 01 00 00</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP:
            0018:ffffb2674c8afbf0 EFLAGS: 00010203</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX:
            ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX:
            0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP:
            ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10:
            00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13:
            ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS:
             00007fe62124a700(0000) GS:ffff92053fd00000(0000)
            knlGS:0000000000000000</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010
            DS: 0000 ES: 0000 CR0: 0000000080050033</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2:
            000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call
            Trace:</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.499199]
             <TASK></div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.502261]
             kfd_process_evict_queues+0x43/0xf0 [amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.506378]
             kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.510539]
             amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.514110]
             amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.518247]
             __mmu_notifier_invalidate_range_start+0x136/0x1e0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.521252]
             change_protection+0x41d/0xcd0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.524310]
             change_prot_numa+0x19/0x30</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.527366]
             task_numa_work+0x1ca/0x330</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.530157]
             task_work_run+0x6c/0xa0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.533124]
             exit_to_user_mode_prepare+0x1af/0x1c0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.536058]
             syscall_exit_to_user_mode+0x2a/0x40</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.538989]
             do_syscall_64+0x46/0xb0</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.541830]
             entry_SYSCALL_64_after_hwframe+0x44/0xae</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP:
            0033:0x7fe6585ec317</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3
            66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff
            ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
            05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00
            f7 d8 64 89 01 48</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP:
            002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX:
            0000000000000010</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX:
            ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX:
            00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP:
            00007fe621249540 R08: 0000000000000000 R09: 0000000000040000</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10:
            00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13:
            0000000000000003 R14: 0000000000000064 R15: 00007fe621249920</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.570470]
             </TASK></div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules
            linked in: amdgpu(OE) veth nf_conntrack_netlink nfnetlink
            xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
            iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
            nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT
            nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
            ebtables ip6table_filter ip6_tables iptable_filter overlay
            esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr
            intel_rapl_common sb_edac snd_hda_codec_hdmi
            x86_pkg_temp_thermal snd_hda_intel intel_powerclamp
            snd_intel_dspcfg ipmi_ssif coretemp snd_hda_codec kvm_intel
            snd_hda_core snd_hwdep kvm snd_pcm snd_timer snd soundcore
            ftdi_sio irqbypass rapl intel_cstate usbserial joydev mei_me
            input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si
            ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler
            sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp
            libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
            x_tables autofs4 btrfs blake2b_generic zstd_compress raid10
            raid456</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.573543]
             async_raid6_recov async_memcpy async_pq async_xor async_tx
            xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2
            gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
            drm_kms_helper syscopyarea hid_generic crct10dif_pclmul
            crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt
            fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid
            drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit
            wmi [last unloaded: amdgpu]</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2:
            000000020000006e</div>
          <div>May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end
            trace 349cf28efb6268bc ]—</div>
          <div><br class="">
          </div>
          <div>Looking forward to the comments.</div>
          <div><br class="">
          </div>
          <div>Regards,</div>
          <div>Shuotao</div>
          <div><br class="">
          </div>
        </div>
        <blockquote type="cite" class="">
          <div class=""><br style="caret-color: rgb(0, 0, 0);
              font-family: Helvetica; font-size: 12px; font-style:
              normal; font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">Regards,</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal;
              text-align: start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <span style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none; float: none; display: inline
              !important;" class="">Felix</span><br style="caret-color:
              rgb(0, 0, 0); font-family: Helvetica; font-size: 12px;
              font-style: normal; font-variant-caps: normal;
              font-weight: 400; letter-spacing: normal; text-align:
              start; text-indent: 0px; text-transform: none;
              white-space: normal; word-spacing: 0px;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <br style="caret-color: rgb(0, 0, 0); font-family:
              Helvetica; font-size: 12px; font-style: normal;
              font-variant-caps: normal; font-weight: 400;
              letter-spacing: normal; text-align: start; text-indent:
              0px; text-transform: none; white-space: normal;
              word-spacing: 0px; -webkit-text-stroke-width: 0px;
              text-decoration: none;" class="">
            <blockquote type="cite" style="font-family: Helvetica;
              font-size: 12px; font-style: normal; font-variant-caps:
              normal; font-weight: 400; letter-spacing: normal; orphans:
              auto; text-align: start; text-indent: 0px; text-transform:
              none; white-space: normal; widows: auto; word-spacing:
              0px; -webkit-text-size-adjust: auto;
              -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
              }<br class="">
              <br class="">
              Regards,<br class="">
              Shuotao<br class="">
              <blockquote type="cite" class=""><br class="">
                Andrey<br class="">
                <br class="">
                <br class="">
                <blockquote type="cite" class=""><br class="">
                  Regards,<br class="">
                  Shuotao<br class="">
                  <br class="">
                  [  +0.001645] BUG: unable to handle page fault for
                  address:<br class="">
                  0000000000058a68<br class="">
                  [  +0.001298] #PF: supervisor read access in kernel
                  mode<br class="">
                  [  +0.001252] #PF: error_code(0x0000) - not-present
                  page<br class="">
                  [  +0.001248] PGD 8000000115806067 P4D
                  8000000115806067 PUD<br class="">
                  109b2d067 PMD 0<br class="">
                  [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI<br class="">
                  [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar
                  Tainted: G<br class="">
                    W   E     5.16.0+ #3<br class="">
                  [  +0.001290] Hardware name: Dell Inc. PowerEdge
                  R730/0H21J3, BIOS<br class="">
                  1.5.4 [FPGA Test BIOS] 10/002/2015<br class="">
                  [  +0.001309] RIP:
                  0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]<br class="">
                  [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f
                  3f 75 ae 0f 1f<br class="">
                  44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75
                  0d 4c 03 a3 a0<br class="">
                  09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01
                  00 4c 89 f7 e8 a2 4c<br class="">
                  2e ca 85<br class="">
                  [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS:
                  00010202<br class="">
                  [  +0.001388] RAX: ffffffffc09a4270 RBX:
                  ffff8b0c9c840000 RCX:<br class="">
                  00000000ffffffff<br class="">
                  [  +0.001402] RDX: 0000000000000000 RSI:
                  000000000001629a RDI:<br class="">
                  ffff8b0c9c840000<br class="">
                  [  +0.001418] RBP: ffffb58fac313948 R08:
                  0000000000000021 R09:<br class="">
                  0000000000000001<br class="">
                  [  +0.001421] R10: ffffb58fac313b30 R11:
                  ffffffff8c065b00 R12:<br class="">
                  0000000000058a68<br class="">
                  [  +0.001400] R13: 000000000001629a R14:
                  0000000000000000 R15:<br class="">
                  000000000001629a<br class="">
                  [  +0.001397] FS:  0000000000000000(0000)
                  GS:ffff8b4b7fa80000(0000)<br class="">
                  knlGS:0000000000000000<br class="">
                  [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0:
                  0000000080050033<br class="">
                  [  +0.001405] CR2: 0000000000058a68 CR3:
                  000000010a2c8001 CR4:<br class="">
                  00000000001706e0<br class="">
                  [  +0.001422] Call Trace:<br class="">
                  [  +0.001407]  <TASK><br class="">
                  [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]<br class="">
                  [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20
                  [amdgpu]<br class="">
                  [  +0.001735]
                   phm_wait_for_register_unequal.part.1+0x58/0x90
                  [amdgpu]<br class="">
                  [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30
                  [amdgpu]<br class="">
                  [  +0.001800]  vega20_wait_for_response+0x28/0x80
                  [amdgpu]<br class="">
                  [  +0.001757]
                   vega20_send_msg_to_smc_with_parameter+0x21/0x110
                  [amdgpu]<br class="">
                  [  +0.001838]
                   smum_send_msg_to_smc_with_parameter+0xcd/0x100
                  [amdgpu]<br class="">
                  [  +0.001829]  ? kvfree+0x1e/0x30<br class="">
                  [  +0.001462]
                   vega20_set_power_profile_mode+0x58/0x330 [amdgpu]<br class="">
                  [  +0.001868]  ? kvfree+0x1e/0x30<br class="">
                  [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]<br class="">
                  [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170
                  [amdgpu]<br class="">
                  [  +0.001863]
                   amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]<br class="">
                  [  +0.001866]
                   amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]<br class="">
                  [  +0.001784]  kfd_dec_compute_active+0x2c/0x50
                  [amdgpu]<br class="">
                  [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0
                  [amdgpu]<br class="">
                  [  +0.001728]
                   kfd_process_dequeue_from_all_devices+0x49/0x70
                  [amdgpu]<br class="">
                  [  +0.001730]  kfd_process_notifier_release+0x91/0xe0
                  [amdgpu]<br class="">
                  [  +0.001718]  __mmu_notifier_release+0x77/0x1f0<br class="">
                  [  +0.001411]  exit_mmap+0x1b5/0x200<br class="">
                  [  +0.001396]  ? __switch_to+0x12d/0x3e0<br class="">
                  [  +0.001388]  ? __switch_to_asm+0x36/0x70<br class="">
                  [  +0.001372]  ? preempt_count_add+0x74/0xc0<br class="">
                  [  +0.001364]  mmput+0x57/0x110<br class="">
                  [  +0.001349]  do_exit+0x33d/0xc20<br class="">
                  [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30<br class="">
                  [  +0.001346]  do_group_exit+0x43/0xa0<br class="">
                  [  +0.001341]  get_signal+0x131/0x920<br class="">
                  [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870<br class="">
                  [  +0.001303]  ? do_futex+0x125/0x190<br class="">
                  [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0<br class="">
                  [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40<br class="">
                  [  +0.001264]  do_syscall_64+0x46/0xb0<br class="">
                  [  +0.001236]
                   entry_SYSCALL_64_after_hwframe+0x44/0xae<br class="">
                  [  +0.001219] RIP: 0033:0x7f6aff1d2ad3<br class="">
                  [  +0.001177] Code: Unable to access opcode bytes at
                  RIP 0x7f6aff1d2aa9.<br class="">
                  [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS:
                  00000246 ORIG_RAX:<br class="">
                  00000000000000ca<br class="">
                  [  +0.001170] RAX: fffffffffffffe00 RBX:
                  0000000004f542b0 RCX:<br class="">
                  00007f6aff1d2ad3<br class="">
                  [  +0.001168] RDX: 0000000000000000 RSI:
                  0000000000000080 RDI:<br class="">
                  0000000004f542d8<br class="">
                  [  +0.001162] RBP: 0000000004f542d4 R08:
                  0000000000000000 R09:<br class="">
                  0000000000000000<br class="">
                  [  +0.001152] R10: 0000000000000000 R11:
                  0000000000000246 R12:<br class="">
                  0000000004f542d8<br class="">
                  [  +0.001176] R13: 0000000000000000 R14:
                  0000000004f54288 R15:<br class="">
                  0000000000000000<br class="">
                  [  +0.001152]  </TASK><br class="">
                  [  +0.001113] Modules linked in: veth amdgpu(E)
                  nf_conntrack_netlink<br class="">
                  nfnetlink xfrm_user xt_addrtype br_netfilter
                  xt_CHECKSUM<br class="">
                  iptable_mangle xt_MASQUERADE iptable_nat nf_nat
                  xt_conntrack<br class="">
                  nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT
                  nf_reject_ipv4<br class="">
                  xt_tcpudp bridge stp llc ebtable_filter ebtables
                  ip6table_filter<br class="">
                  ip6_tables iptable_filter overlay esp6_offload esp6
                  esp4_offload<br class="">
                  esp4 xfrm_algo intel_rapl_msr intel_rapl_common
                  sb_edac<br class="">
                  x86_pkg_temp_thermal intel_powerclamp
                  snd_hda_codec_hdmi<br class="">
                  snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp
                  snd_hda_codec<br class="">
                  kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd
                  kvm soundcore<br class="">
                  irqbypass ftdi_sio usbserial input_leds iTCO_wdt
                  iTCO_vendor_support<br class="">
                  joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si
                  ipmi_devintf<br class="">
                  ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel
                  ib_iser<br class="">
                  rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
                  libiscsi<br class="">
                  scsi_transport_iscsi ip_tables x_tables autofs4 btrfs<br class="">
                  blake2b_generic zstd_compress raid10 raid456<br class="">
                  [  +0.000102]  async_raid6_recov async_memcpy async_pq
                  async_xor<br class="">
                  async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
                  linear<br class="">
                  iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm
                  drm_shmem_helper<br class="">
                  drm_kms_helper syscopyarea sysfillrect sysimgblt
                  fb_sys_fops<br class="">
                  crct10dif_pclmul hid_generic crc32_pclmul
                  ghash_clmulni_intel usbhid<br class="">
                  uas aesni_intel crypto_simd igb ahci hid drm
                  usb_storage cryptd<br class="">
                  libahci dca megaraid_sas i2c_algo_bit wmi [last
                  unloaded: amdgpu]<br class="">
                  [  +0.016626] CR2: 0000000000058a68<br class="">
                  [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---<br class="">
                  [  +0.024953] RIP:
                  0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]<br class="">
                  [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f
                  3f 75 ae 0f 1f<br class="">
                  44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75
                  0d 4c 03 a3 a0<br class="">
                  09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01
                  00 4c 89 f7 e8 a2 4c<br class="">
                  2e ca 85<br class="">
                  [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS:
                  00010202<br class="">
                  [  +0.001641] RAX: ffffffffc09a4270 RBX:
                  ffff8b0c9c840000 RCX:<br class="">
                  00000000ffffffff<br class="">
                  [  +0.001656] RDX: 0000000000000000 RSI:
                  000000000001629a RDI:<br class="">
                  ffff8b0c9c840000<br class="">
                  [  +0.001681] RBP: ffffb58fac313948 R08:
                  0000000000000021 R09:<br class="">
                  0000000000000001<br class="">
                  [  +0.001662] R10: ffffb58fac313b30 R11:
                  ffffffff8c065b00 R12:<br class="">
                  0000000000058a68<br class="">
                  [  +0.001650] R13: 000000000001629a R14:
                  0000000000000000 R15:<br class="">
                  000000000001629a<br class="">
                  [  +0.001648] FS:  0000000000000000(0000)
                  GS:ffff8b4b7fa80000(0000)<br class="">
                  knlGS:0000000000000000<br class="">
                  [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0:
                  0000000080050033<br class="">
                  [  +0.001673] CR2: 0000000000058a68 CR3:
                  000000010a2c8001 CR4:<br class="">
                  00000000001706e0<br class="">
                  [  +0.001740] Fixing recursive fault but reboot is
                  needed!<br class="">
                  <br class="">
                  <br class="">
                  <blockquote type="cite" class="">On Apr 21, 2022, at
                    2:41 AM, Andrey Grodzovsky<br class="">
                    <<a href="mailto:andrey.grodzovsky@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">andrey.grodzovsky@amd.com</a>>
                    wrote:<br class="">
                    <br class="">
                    I retested hot plug tests at the commit I mentioned
                    bellow - looks<br class="">
                    ok, my ASIC is Navi 10, I also tested using Vega 10
                    and older<br class="">
                    Polaris ASICs (whatever i had at home at the time).
                    It's possible<br class="">
                    there are extra issues in ASICs like ur which I
                    didn't cover during<br class="">
                    tests.<br class="">
                    <br class="">
                    andrey@andrey-test:~/drm$ sudo
                    ./build/tests/amdgpu/amdgpu_test -s 13<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    <br class="">
                    <br class="">
                    The ASIC NOT support UVD, suite disabled<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    <br class="">
                    <br class="">
                    The ASIC NOT support VCE, suite disabled<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    <br class="">
                    <br class="">
                    The ASIC NOT support UVD ENC, suite disabled.<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    <br class="">
                    <br class="">
                    Don't support TMZ (trust memory zone), security
                    suite disabled<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    /usr/local/share/libdrm/amdgpu.ids: No such file or
                    directory<br class="">
                    Peer device is not opened or has ASIC not supported
                    by the suite,<br class="">
                    skip all Peer to Peer tests.<br class="">
                    <br class="">
                    <br class="">
                    CUnit - A unit testing framework for C - Version
                    2.1-3<br class="">
                    <a href="https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kzNRa9d46sBwZCVhu9%2BEkK%2F3f7fyjAo%2BAADtgeoz2l8%3D&reserved=0" originalsrc="http://cunit.sourceforge.net/" shash="mz6Kzjf7NojqeE9BGVLrvEm3IyJe7NwKrHZoxg1rRxeFOTkcFC28UF09ES/2elRxC+ERNKHkdboZ4W5DbH9EHgOogBx8slEYJBRuLvkvHgddsx1Dp6ZmWcjLh8Wnq/56zpfAo1K0ihxSqsuFZ6G6ZtfXiggyJfwGpMRMoWAhcyo=" class="" moz-do-not-send="true">https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&amp;reserved=0</a><br class="">
                    <br class="">
                    <br class="">
                    *Suite: Hotunplug Tests**<br class="">
                    ** Test: Unplug card and rescan the bus to plug it
                    back<br class="">
                    .../usr/local/share/libdrm/amdgpu.ids: No such file
                    or directory**<br class="">
                    **passed**<br class="">
                    ** Test: Same as first test but with command
                    submission<br class="">
                    .../usr/local/share/libdrm/amdgpu.ids: No such file
                    or directory**<br class="">
                    **passed**<br class="">
                    ** Test: Unplug with exported bo<br class="">
                    .../usr/local/share/libdrm/amdgpu.ids: No such file
                    or directory**<br class="">
                    **passed*<br class="">
                    <br class="">
                    Run Summary: Type Total Ran Passed Failed Inactive<br class="">
                    suites 14 1 n/a 0 0<br class="">
                    tests 71 3 3 0 1<br class="">
                    asserts 21 21 21 0 n/a<br class="">
                    <br class="">
                    Elapsed time = 9.195 seconds<br class="">
                    <br class="">
                    <br class="">
                    Andrey<br class="">
                    <br class="">
                    On 2022-04-20 11:44, Andrey Grodzovsky wrote:<br class="">
                    <blockquote type="cite" class=""><br class="">
                      The only one in Radeon 7 I see is the same sysfs
                      crash we already<br class="">
                      fixed so you can use the same fix. The MI 200
                      issue i haven't seen<br class="">
                      yet but I also haven't tested MI200 so never saw
                      it before. Need<br class="">
                      to test when i get the time.<br class="">
                      <br class="">
                      So try that fix with Radeon 7 again to see if you
                      pass the tests<br class="">
                      (the warnings should all be minor issues).<br class="">
                      <br class="">
                      Andrey<br class="">
                      <br class="">
                      <br class="">
                      On 2022-04-20 05:24, Shuotao Xu wrote:<br class="">
                      <blockquote type="cite" class="">
                        <blockquote type="cite" class=""><br class="">
                          That a problem, latest working baseline I
                          tested and confirmed<br class="">
                          passing hotplug tests is this branch and<br class="">
                          <a href="commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&amp;reserved=0" class="" moz-do-not-send="true">commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&amp;reserved=0</a><br class="">
                          is amd-staging-drm-next. 5.14 was the branch
                          we ups-reamed the<br class="">
                          hotplug code but it had a lot of regressions
                          over time due to<br class="">
                          new changes (that why I added the hotplug test
                          to try and catch<br class="">
                          them early). It would be best to run this
                          branch on mi-100 so we<br class="">
                          have a clean baseline and only after
                          confirming this particular<br class="">
                          branch from this commits passes libdrm tests
                          only then start<br class="">
                          adding the KFD specific addons. Another option
                          if you can't work<br class="">
                          with MI-100 and this branch is to try a
                          different ASIC that does<br class="">
                          work with this branch (if possible).<br class="">
                          <br class="">
                          Andrey<br class="">
                          <br class="">
                        </blockquote>
                        OK I tried both this commit and the HEAD of
                        and-staging-drm-next<br class="">
                        on two GPUs( MI100 and Radeon VII) both did not
                        pass hotplugout<br class="">
                        libdrm test. I might be able to gain access to
                        MI200, but I<br class="">
                        suspect it would work.<br class="">
                        <br class="">
                        I copied the complete dmesgs as follows. I
                        highlighted the OOPSES<br class="">
                        for you.<br class="">
                        <br class="">
                        Radeon VII:</blockquote>
                    </blockquote>
                  </blockquote>
                </blockquote>
              </blockquote>
            </blockquote>
          </div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
  </body>
</html>