[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

Wed May 11 13:49:35 UTC 2022

On 2022-05-10 23:35, Shuotao Xu wrote:
>
>
>> On May 11, 2022, at 4:31 AM, Felix Kuehling <felix.kuehling at amd.com> 
>> wrote:
>>
>> [Some people who received this message don't often get email 
>> fromfelix.kuehling at amd.com. Learn why this is important 
>> athttps://aka.ms/LearnAboutSenderIdentification.]
>>
>> Am 2022-05-10 um 07:03 schrieb Shuotao Xu:
>>>
>>>
>>>> On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky
>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>
>>>> On 2022-04-27 05:20, Shuotao Xu wrote:
>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> Sorry that I did not have time to work on this for a few days.
>>>>>
>>>>> I just tried the sysfs crash fix on Radeon VII and it seems that it
>>>>> worked. It did not pass last the hotplug test, but my version has 4
>>>>> tests instead of 3 in your case.
>>>>
>>>>
>>>> That because the 4th one is only enabled when here are 2 cards in the
>>>> system - to test DRI_PRIME export. I tested this time with only one 
>>>> card.
>>>>
>>> Yes, I only had one Radeon VII in my system, so this 4th test should
>>> have been skipped. I am ignoring this issue.
>>>
>>>>>
>>>>>
>>>>> Suite: Hotunplug Tests
>>>>> Test: Unplug card and rescan the bus to plug it back
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Same as first test but with command submission
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported bo
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported fence
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
>>>>
>>>>
>>>> on the kernel side - the IOCTlL returning this is drm_getclient -
>>>> maybe take a look while it can't find client it ? I didn't have such
>>>> issue as far as I remember when testing.
>>>>
>>>>
>>>>> FAILED
>>>>> 1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
>>>>> 2. ../tests/amdgpu/hotunplug_tests.c:411 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
>>>>> &sync_obj_handle2),0)
>>>>> 3. ../tests/amdgpu/hotunplug_tests.c:423 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,
>>>>> 1, 100000000, 0, NULL),0)
>>>>> 4. ../tests/amdgpu/hotunplug_tests.c:425 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, 
>>>>> sync_obj_handle2),0)
>>>>>
>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>> suites 14 1 n/a 0 0
>>>>> tests 71 4 3 1 0
>>>>> asserts 39 39 35 4 n/a
>>>>>
>>>>> Elapsed time = 17.321 seconds
>>>>>
>>>>> For kfd compute, there is some problem which I did not see in MI100
>>>>> after I killed the hung application after hot plugout. I was using
>>>>> rocm5.0.2 driver for MI100 card, and not sure if it is a regression
>>>>> from the newer driver.
>>>>> After pkill, one of child of user process would be stuck in Zombie
>>>>> mode (Z) understandably because of the bug, and future rocm
>>>>> application after plug-back would in uninterrupted sleep mode (D)
>>>>> because it would not return from syscall to kfd.
>>>>>
>>>>> Although drm test for amdgpu would run just fine without issues
>>>>> after plug-back with dangling kfd state.
>>>>
>>>>
>>>> I am not clear when the crash bellow happens ? Is it related to what
>>>> you describe above ?
>>>>
>>>>
>>>>>
>>>>> I don’t know if there is a quick fix to it. I was thinking add
>>>>> drm_enter/drm_exit to amdgpu_device_rreg.
>>>>
>>>>
>>>> Try adding drm_dev_enter/exit pair at the highest level of attmetong
>>>> to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We
>>>> always try to avoid accessing any HW functions after backing device
>>>> is gone.
>>>>
>>>>
>>>>> Also this has been a long time in my attempt to fix hotplug issue
>>>>> for kfd application.
>>>>> I don’t know 1) if I would be able to get to MI100 (fixing Radeon
>>>>> VII would mean something but MI100 is more important for us); 2)
>>>>> what the direct of the patch to this issue will move forward.
>>>>
>>>>
>>>> I will go to office tomorrow to pick up MI-100, With time and
>>>> priorities permitting I will then then try to test it and fix any
>>>> bugs such that it will be passing all hot plug libdrm tests at the
>>>> tip of public amd-staging-drm-next
>>>> -https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4 
>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ub4jMSDBchMgrgzlDu1vMiNypFnsfN%2FcPuZgqa7ZJk8%3D&reserved=0>%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&reserved=0, 
>>>> after that you can try
>>>> to continue working with ROCm enabling on top of that.
>>>>
>>>> For now i suggest you move on with Radeon 7 which as your development
>>>> ASIC and use the fix i mentioned above.
>>>>
>>> I finally got some time to continue on kfd hotplug patch attempt.
>>> The following patch seems to work for kfd hotplug on Radeon VII. After
>>> hot plugout, the tf process exists because of vm fault.
>>> A new tf process run without issues after plugback.
>>>
>>> It has the following fixes.
>>>
>>> 1. ras sysfs regression;
>>> 2. skip setting compute idle after dev is plugged, otherwise it will
>>>    try to write the pci bar thus driver fault
>>> 3. stops the actual work of invalidate memory map triggered by
>>>    useptrs; (return false will trigger warning, so I returned true.
>>>    Not sure if it is correct)
>>> 4. It sends exceptions to all the events/signal that a “zombie”
>>>    process that are waiting for. (Not sure if the hw_exception is
>>>    worthwhile, it did not do anything in my case since there is such
>>>    event type associated with that process)
>>>
>>> Please take a look and let me know if it acceptable.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 1f8161cd507f..2f7858692067 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -33,6 +33,7 @@
>>> #include <uapi/linux/kfd_ioctl.h>
>>> #include "amdgpu_ras.h"
>>> #include "amdgpu_umc.h"
>>> +#include <drm/drm_drv.h>
>>>
>>> /* Total memory size in system memory and all GPU VRAM. Used to
>>>  * estimate worst case amount of memory to reserve for page tables
>>> @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device
>>> *adev,
>>>
>>> void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool
>>> idle)
>>> {
>>> -       amdgpu_dpm_switch_power_profile(adev,
>>> - PP_SMC_POWER_PROFILE_COMPUTE,
>>> -                                       !idle);
>>> +       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
>>> +               amdgpu_dpm_switch_power_profile(adev,
>>> + PP_SMC_POWER_PROFILE_COMPUTE,
>>> +                                               !idle);
>>> }
>>>
>>> bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 4b153daf283d..fb4c9e55eace 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -46,6 +46,7 @@
>>> #include <linux/firmware.h>
>>> #include <linux/module.h>
>>> #include <drm/drm.h>
>>> +#include <drm/drm_drv.h>
>>>
>>> #include "amdgpu.h"
>>> #include "amdgpu_amdkfd.h"
>>> @@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct
>>> mmu_interval_notifier *mni,
>>>        struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo,
>>> notifier);
>>>        struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>>>
>>> +       if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>> +               return true;
>>> +
> Label: Fix 3
>>>        if (!mmu_notifier_range_blockable(range))
>>>                return false;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index cac56f830aed..fbbaaabf3a67 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct
>>> amdgpu_device *adev)
>>>                }
>>>        }
>>>
>>> -       amdgpu_ras_sysfs_remove_all(adev);
>>>        return 0;
>>> }
>>> /* ras fs end */
>>> @@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct
>>> amdgpu_device *adev,
>>>        if (!ras_block)
>>>                return;
>>>
>>> -       amdgpu_ras_sysfs_remove(adev, ras_block);
>>> -
>>>        ras_obj = container_of(ras_block, struct
>>> amdgpu_ras_block_object, ras_comm);
>>>        if (ras_obj->ras_cb)
>>>                amdgpu_ras_interrupt_remove_handler(adev, ras_block);
>>> @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device 
>>> *adev)
>>>        /* Need disable ras on all IPs here before ip [hw/sw]fini */
>>>        amdgpu_ras_disable_all_features(adev, 0);
>>>        amdgpu_ras_recovery_fini(adev);
>>> +       amdgpu_ras_sysfs_remove_all(adev);
>>>        return 0;
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index f1a225a20719..4b789bec9670 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>>>
>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>> {
>>> +       struct kfd_process *p;
>>> +       struct amdkfd_process_info *p_info;
>>> +       unsigned int temp;
>>> +
>>>        if (!kfd->init_complete)
>>>                return;
>>>
>>>        /* for runtime suspend, skip locking kfd */
>>> -       if (!run_pm) {
>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>                /* For first KFD device suspend all the KFD processes */
>>>                if (atomic_inc_return(&kfd_locked) == 1)
>>>                        kfd_suspend_all_processes();
>>>        }
>>>
>>> +       if (drm_dev_is_unplugged(kfd->ddev)){
>>> +               int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +               pr_debug("cancel restore_userptr_work\n");
>>> +               hash_for_each_rcu(kfd_processes_table, temp, p,
>>> kfd_processes) {
>>> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>>> >= 0) {
>>> +                               p_info = p->kgd_process_info;
>>> +                               pr_debug("cancel processes, pid = %d
>>> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
>>> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>
>> Is this really necessary? If it is, there are probably other workers,
>> e.g. related to our SVM code, that would need to be canceled as well.
>>
>
> I delete this and it seems to be OK. It was previously added to 
> suppress restore_useptr_work which keeps updating PTE.
> Now this is gone by Fix 3. Please let us know if it is OK:) @Felix
>
>>
>>> +
>>> + /* send exception signals to the kfd
>>> events waiting in user space */
>>> + kfd_signal_hw_exception_event(p->pasid);
>>
>> This makes sense. It basically tells user mode that the application's
>> GPU state is lost due to a RAS error or a GPU reset, or now a GPU
>> hot-unplug.
>
> The problem is that it cannot find an event with a type that matches 
> HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
> default parameter value of send_sigterm = false;
> After all, if a “zombie” process (zombie in the sense it does not have 
> a GPU dev) does not exit, kfd resources seems not been released 
> properly and new kfd process cannot run after plug back.
> (I still need to look hard into rocr/hsakmt/kfd driver code to 
> understand the reason. At least I am seeing that the kfd topology 
> won’t be cleaned up without process exiting, so that there would be a 
> “zombie" kfd node in the topology, which may or may not cause issues 
> in hsakmt).
> @Felix Do you have suggestion/insight on this “zombie" process issue? 
> @Andrey suggests it should be OK to have a “zombie” kfd process and a 
> “zombie” kfd dev, and the new kfd process should be ok to run on the 
> new kfd dev after plugback.

My experience with the graphic stack at least showed that. At least in a 
setup with 2 GPUs, if i remove a secondary GPU which had a rendering 
process on it, I could plug back the secondary GPU and start a new 
rendering process while the old zombie process was still present. It 
could be that in KFD case there are some obstacles to this that need to 
be resolved.

Andrey

>
> May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
> restore_userptr_work
> May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
> exception to pasid = 0x800
> May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: 
> Process 25894 (pasid 0x8001) got unhandled exception
>
>>
>>
>>> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
>>
>> This does not make sense. A VM fault indicates an access to a bad
>> virtual address by the GPU. If a debugger is attached to the process, it
>> notifies the debugger to investigate what went wrong. If the GPU is
>> gone, that doesn't make any sense. There is no GPU that could have
>> issued a bad memory request. And the debugger won't be happy either to
>> find a VM fault from a GPU that doesn't exist any more.
>
> OK understood.
>
>>
>> If the HW-exception event doesn't terminate your process, we may need to
>> look into how ROCr handles the HW-exception events.
>>
>>
>>> + }
>>> + }
>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>> + }
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>> kfd_iommu_suspend(kfd);
>>
>> Should DQM stop and IOMMU suspend still be executed? Or should the
>> hot-unplug case short-circuit them?
>
> I tried short circuiting them, but would later caused BUG related to 
> GPU reset. I added the following that solve the issue on plugout.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..d78a06d74759 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5317,7 +5317,8 @@ static void 
> amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>  {
>         struct amdgpu_recover_work_struct *recover_work = 
> container_of(work, struct amdgpu_recover_work_struct, base);
>
> -       recover_work->ret = 
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
> +               recover_work->ret = 
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>  }
>  /*
>   * Serialize gpu recover into reset domain single threaded wq
>
> However after kill the zombie process, it failed to evict queues of 
> the process.
>
> [  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
> [  +9.002503] amdgpu: qcm fence wait loop timeout expired
> [  +0.001364] amdgpu: The cp might be in an unrecoverable state due to 
> an unsuccessful queues preemption
> [  +0.001343] amdgpu: Failed to evict process queues
> [  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
>
>
> This would cause driver BUG triggered by new kfd process after 
> plugback. I am pasting the errors from dmesg after plugback as below.
>
>
>
> May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting PASID 
> 0x8001 queues
> May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to handle 
> page fault for address: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read 
> access in kernel mode
> May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: 
> error_code(0x0000) - not-present page
> May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 
> P4D 80000020892a8067 PUD 0
> May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] 
> PREEMPT SMP PTI
> May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 
> Comm: tf_cnn_benchmar Tainted: G        W  OE 5.16.0+ #3
> May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: 
> amdgpu: GPU reset begin!
> May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell 
> Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
> May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 
> 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 
> c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 
> 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74 
> ea c6 43 6e 00 41 83 ac 24 70 01 00 00
> May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 
> 0018:ffffb2674c8afbf0 EFLAGS: 00010203
> May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: ffff91c65cca3800 
> RBX: 0000000200000000 RCX: 0000000000000001
> May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 0000000000000000 
> RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
> May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: ffffb2674c8afc20 
> R08: 0000000000000000 R09: 000000000006ba18
> May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 00007fe5a0000000 
> R11: ffffb2674c8af918 R12: ffff91c66d6f5800
> May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: ffff91c66d6f5938 
> R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
> May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS: 
>  00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
> May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 ES: 
> 0000 CR0: 0000000080050033
> May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 000000020000006e 
> CR3: 0000002095284004 CR4: 00000000001706e0
> May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
> May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
> May 11 10:25:16 NETSYS26 kernel: [  688.502261] 
>  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.506378] 
>  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.510539] 
>  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.514110] 
>  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.518247] 
>  __mmu_notifier_invalidate_range_start+0x136/0x1e0
> May 11 10:25:16 NETSYS26 kernel: [  688.521252] 
>  change_protection+0x41d/0xcd0
> May 11 10:25:16 NETSYS26 kernel: [  688.524310] 
>  change_prot_numa+0x19/0x30
> May 11 10:25:16 NETSYS26 kernel: [  688.527366] 
>  task_numa_work+0x1ca/0x330
> May 11 10:25:16 NETSYS26 kernel: [  688.530157]  task_work_run+0x6c/0xa0
> May 11 10:25:16 NETSYS26 kernel: [  688.533124] 
>  exit_to_user_mode_prepare+0x1af/0x1c0
> May 11 10:25:16 NETSYS26 kernel: [  688.536058] 
>  syscall_exit_to_user_mode+0x2a/0x40
> May 11 10:25:16 NETSYS26 kernel: [  688.538989]  do_syscall_64+0x46/0xb0
> May 11 10:25:16 NETSYS26 kernel: [  688.541830] 
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 0033:0x7fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 
> 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 
> 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 
> 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
> May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 
> 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: ffffffffffffffc2 
> RBX: 0000000000000000 RCX: 00007fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 00007fe621249540 
> RSI: 00000000c0584b02 RDI: 0000000000000003
> May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 00007fe621249540 
> R08: 0000000000000000 R09: 0000000000040000
> May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 00007fe617480000 
> R11: 0000000000000246 R12: 00000000c0584b02
> May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 0000000000000003 
> R14: 0000000000000064 R15: 00007fe621249920
> May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
> May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: 
> amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype 
> br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
> nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter overlay 
> esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr 
> intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal 
> snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp 
> snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer 
> snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev 
> mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si 
> ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel 
> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
> zstd_compress raid10 raid456
> May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov 
> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 
> raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm 
> drm_shmem_helper drm_kms_helper syscopyarea hid_generic 
> crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel 
> sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid 
> drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last 
> unloaded: amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 
> 349cf28efb6268bc ]—
>
> Looking forward to the comments.
>
> Regards,
> Shuotao
>
>>
>> Regards,
>> Felix
>>
>>
>>> }
>>>
>>> Regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>
>>>>> [  +0.001645] BUG: unable to handle page fault for address:
>>>>> 0000000000058a68
>>>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
>>>>> 109b2d067 PMD 0
>>>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
>>>>>   W   E     5.16.0+ #3
>>>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
>>>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [  +0.001422] Call Trace:
>>>>> [  +0.001407]  <TASK>
>>>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
>>>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 
>>>>> [amdgpu]
>>>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
>>>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
>>>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>>>> [  +0.001364]  mmput+0x57/0x110
>>>>> [  +0.001349]  do_exit+0x33d/0xc20
>>>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>>>> [  +0.001341]  get_signal+0x131/0x920
>>>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>>>> [  +0.001303]  ? do_futex+0x125/0x190
>>>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 
>>>>> 0x7f6aff1d2aa9.
>>>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
>>>>> 00000000000000ca
>>>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
>>>>> 00007f6aff1d2ad3
>>>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
>>>>> 0000000004f542d8
>>>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>> 0000000004f542d8
>>>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
>>>>> 0000000000000000
>>>>> [  +0.001152]  </TASK>
>>>>> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
>>>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
>>>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
>>>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
>>>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
>>>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
>>>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
>>>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore
>>>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support
>>>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
>>>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
>>>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
>>>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
>>>>> blake2b_generic zstd_compress raid10 raid456
>>>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor
>>>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid
>>>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
>>>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>>>> [  +0.016626] CR2: 0000000000058a68
>>>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>>>
>>>>>
>>>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>>>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>>>
>>>>>> I retested hot plug tests at the commit I mentioned bellow - looks
>>>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older
>>>>>> Polaris ASICs (whatever i had at home at the time). It's possible
>>>>>> there are extra issues in ASICs like ur which I didn't cover during
>>>>>> tests.
>>>>>>
>>>>>> andrey at andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support VCE, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> Peer device is not opened or has ASIC not supported by the suite,
>>>>>> skip all Peer to Peer tests.
>>>>>>
>>>>>>
>>>>>> CUnit - A unit testing framework for C - Version 2.1-3
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&reserved=0 
>>>>>> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kzNRa9d46sBwZCVhu9%2BEkK%2F3f7fyjAo%2BAADtgeoz2l8%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>> *Suite: Hotunplug Tests**
>>>>>> ** Test: Unplug card and rescan the bus to plug it back
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Same as first test but with command submission
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Unplug with exported bo
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed*
>>>>>>
>>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>>> suites 14 1 n/a 0 0
>>>>>> tests 71 3 3 0 1
>>>>>> asserts 21 21 21 0 n/a
>>>>>>
>>>>>> Elapsed time = 9.195 seconds
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>>>
>>>>>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen
>>>>>>> yet but I also haven't tested MI200 so never saw it before. Need
>>>>>>> to test when i get the time.
>>>>>>>
>>>>>>> So try that fix with Radeon 7 again to see if you pass the tests
>>>>>>> (the warnings should all be minor issues).
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>> That a problem, latest working baseline I tested and confirmed
>>>>>>>>> passing hotplug tests is this branch and
>>>>>>>>> commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0 
>>>>>>>>> <commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0>
>>>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
>>>>>>>>> hotplug code but it had a lot of regressions over time due to
>>>>>>>>> new changes (that why I added the hotplug test to try and catch
>>>>>>>>> them early). It would be best to run this branch on mi-100 so we
>>>>>>>>> have a clean baseline and only after confirming this particular
>>>>>>>>> branch from this commits passes libdrm tests only then start
>>>>>>>>> adding the KFD specific addons. Another option if you can't work
>>>>>>>>> with MI-100 and this branch is to try a different ASIC that does
>>>>>>>>> work with this branch (if possible).
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next
>>>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
>>>>>>>> libdrm test. I might be able to gain access to MI200, but I
>>>>>>>> suspect it would work.
>>>>>>>>
>>>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Radeon VII:
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220511/435bb8a3/attachment-0001.htm>