[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed May 11 13:49:35 UTC 2022
On 2022-05-10 23:35, Shuotao Xu wrote:
>
>
>> On May 11, 2022, at 4:31 AM, Felix Kuehling <felix.kuehling at amd.com>
>> wrote:
>>
>> [Some people who received this message don't often get email
>> fromfelix.kuehling at amd.com. Learn why this is important
>> athttps://aka.ms/LearnAboutSenderIdentification.]
>>
>> Am 2022-05-10 um 07:03 schrieb Shuotao Xu:
>>>
>>>
>>>> On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky
>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>
>>>> On 2022-04-27 05:20, Shuotao Xu wrote:
>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> Sorry that I did not have time to work on this for a few days.
>>>>>
>>>>> I just tried the sysfs crash fix on Radeon VII and it seems that it
>>>>> worked. It did not pass last the hotplug test, but my version has 4
>>>>> tests instead of 3 in your case.
>>>>
>>>>
>>>> That because the 4th one is only enabled when here are 2 cards in the
>>>> system - to test DRI_PRIME export. I tested this time with only one
>>>> card.
>>>>
>>> Yes, I only had one Radeon VII in my system, so this 4th test should
>>> have been skipped. I am ignoring this issue.
>>>
>>>>>
>>>>>
>>>>> Suite: Hotunplug Tests
>>>>> Test: Unplug card and rescan the bus to plug it back
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Same as first test but with command submission
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported bo
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported fence
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
>>>>
>>>>
>>>> on the kernel side - the IOCTlL returning this is drm_getclient -
>>>> maybe take a look while it can't find client it ? I didn't have such
>>>> issue as far as I remember when testing.
>>>>
>>>>
>>>>> FAILED
>>>>> 1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
>>>>> 2. ../tests/amdgpu/hotunplug_tests.c:411 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
>>>>> &sync_obj_handle2),0)
>>>>> 3. ../tests/amdgpu/hotunplug_tests.c:423 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,
>>>>> 1, 100000000, 0, NULL),0)
>>>>> 4. ../tests/amdgpu/hotunplug_tests.c:425 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2,
>>>>> sync_obj_handle2),0)
>>>>>
>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>> suites 14 1 n/a 0 0
>>>>> tests 71 4 3 1 0
>>>>> asserts 39 39 35 4 n/a
>>>>>
>>>>> Elapsed time = 17.321 seconds
>>>>>
>>>>> For kfd compute, there is some problem which I did not see in MI100
>>>>> after I killed the hung application after hot plugout. I was using
>>>>> rocm5.0.2 driver for MI100 card, and not sure if it is a regression
>>>>> from the newer driver.
>>>>> After pkill, one of child of user process would be stuck in Zombie
>>>>> mode (Z) understandably because of the bug, and future rocm
>>>>> application after plug-back would in uninterrupted sleep mode (D)
>>>>> because it would not return from syscall to kfd.
>>>>>
>>>>> Although drm test for amdgpu would run just fine without issues
>>>>> after plug-back with dangling kfd state.
>>>>
>>>>
>>>> I am not clear when the crash bellow happens ? Is it related to what
>>>> you describe above ?
>>>>
>>>>
>>>>>
>>>>> I don’t know if there is a quick fix to it. I was thinking add
>>>>> drm_enter/drm_exit to amdgpu_device_rreg.
>>>>
>>>>
>>>> Try adding drm_dev_enter/exit pair at the highest level of attmetong
>>>> to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We
>>>> always try to avoid accessing any HW functions after backing device
>>>> is gone.
>>>>
>>>>
>>>>> Also this has been a long time in my attempt to fix hotplug issue
>>>>> for kfd application.
>>>>> I don’t know 1) if I would be able to get to MI100 (fixing Radeon
>>>>> VII would mean something but MI100 is more important for us); 2)
>>>>> what the direct of the patch to this issue will move forward.
>>>>
>>>>
>>>> I will go to office tomorrow to pick up MI-100, With time and
>>>> priorities permitting I will then then try to test it and fix any
>>>> bugs such that it will be passing all hot plug libdrm tests at the
>>>> tip of public amd-staging-drm-next
>>>> -https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4
>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ub4jMSDBchMgrgzlDu1vMiNypFnsfN%2FcPuZgqa7ZJk8%3D&reserved=0>%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&reserved=0,
>>>> after that you can try
>>>> to continue working with ROCm enabling on top of that.
>>>>
>>>> For now i suggest you move on with Radeon 7 which as your development
>>>> ASIC and use the fix i mentioned above.
>>>>
>>> I finally got some time to continue on kfd hotplug patch attempt.
>>> The following patch seems to work for kfd hotplug on Radeon VII. After
>>> hot plugout, the tf process exists because of vm fault.
>>> A new tf process run without issues after plugback.
>>>
>>> It has the following fixes.
>>>
>>> 1. ras sysfs regression;
>>> 2. skip setting compute idle after dev is plugged, otherwise it will
>>> try to write the pci bar thus driver fault
>>> 3. stops the actual work of invalidate memory map triggered by
>>> useptrs; (return false will trigger warning, so I returned true.
>>> Not sure if it is correct)
>>> 4. It sends exceptions to all the events/signal that a “zombie”
>>> process that are waiting for. (Not sure if the hw_exception is
>>> worthwhile, it did not do anything in my case since there is such
>>> event type associated with that process)
>>>
>>> Please take a look and let me know if it acceptable.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 1f8161cd507f..2f7858692067 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -33,6 +33,7 @@
>>> #include <uapi/linux/kfd_ioctl.h>
>>> #include "amdgpu_ras.h"
>>> #include "amdgpu_umc.h"
>>> +#include <drm/drm_drv.h>
>>>
>>> /* Total memory size in system memory and all GPU VRAM. Used to
>>> * estimate worst case amount of memory to reserve for page tables
>>> @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device
>>> *adev,
>>>
>>> void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool
>>> idle)
>>> {
>>> - amdgpu_dpm_switch_power_profile(adev,
>>> - PP_SMC_POWER_PROFILE_COMPUTE,
>>> - !idle);
>>> + if (!drm_dev_is_unplugged(adev_to_drm(adev)))
>>> + amdgpu_dpm_switch_power_profile(adev,
>>> + PP_SMC_POWER_PROFILE_COMPUTE,
>>> + !idle);
>>> }
>>>
>>> bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 4b153daf283d..fb4c9e55eace 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -46,6 +46,7 @@
>>> #include <linux/firmware.h>
>>> #include <linux/module.h>
>>> #include <drm/drm.h>
>>> +#include <drm/drm_drv.h>
>>>
>>> #include "amdgpu.h"
>>> #include "amdgpu_amdkfd.h"
>>> @@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct
>>> mmu_interval_notifier *mni,
>>> struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo,
>>> notifier);
>>> struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>>>
>>> + if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>> + return true;
>>> +
> Label: Fix 3
>>> if (!mmu_notifier_range_blockable(range))
>>> return false;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index cac56f830aed..fbbaaabf3a67 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct
>>> amdgpu_device *adev)
>>> }
>>> }
>>>
>>> - amdgpu_ras_sysfs_remove_all(adev);
>>> return 0;
>>> }
>>> /* ras fs end */
>>> @@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct
>>> amdgpu_device *adev,
>>> if (!ras_block)
>>> return;
>>>
>>> - amdgpu_ras_sysfs_remove(adev, ras_block);
>>> -
>>> ras_obj = container_of(ras_block, struct
>>> amdgpu_ras_block_object, ras_comm);
>>> if (ras_obj->ras_cb)
>>> amdgpu_ras_interrupt_remove_handler(adev, ras_block);
>>> @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device
>>> *adev)
>>> /* Need disable ras on all IPs here before ip [hw/sw]fini */
>>> amdgpu_ras_disable_all_features(adev, 0);
>>> amdgpu_ras_recovery_fini(adev);
>>> + amdgpu_ras_sysfs_remove_all(adev);
>>> return 0;
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index f1a225a20719..4b789bec9670 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>>>
>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>> {
>>> + struct kfd_process *p;
>>> + struct amdkfd_process_info *p_info;
>>> + unsigned int temp;
>>> +
>>> if (!kfd->init_complete)
>>> return;
>>>
>>> /* for runtime suspend, skip locking kfd */
>>> - if (!run_pm) {
>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>> /* For first KFD device suspend all the KFD processes */
>>> if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes();
>>> }
>>>
>>> + if (drm_dev_is_unplugged(kfd->ddev)){
>>> + int idx = srcu_read_lock(&kfd_processes_srcu);
>>> + pr_debug("cancel restore_userptr_work\n");
>>> + hash_for_each_rcu(kfd_processes_table, temp, p,
>>> kfd_processes) {
>>> + if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>>> >= 0) {
>>> + p_info = p->kgd_process_info;
>>> + pr_debug("cancel processes, pid = %d
>>> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
>>> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>
>> Is this really necessary? If it is, there are probably other workers,
>> e.g. related to our SVM code, that would need to be canceled as well.
>>
>
> I delete this and it seems to be OK. It was previously added to
> suppress restore_useptr_work which keeps updating PTE.
> Now this is gone by Fix 3. Please let us know if it is OK:) @Felix
>
>>
>>> +
>>> + /* send exception signals to the kfd
>>> events waiting in user space */
>>> + kfd_signal_hw_exception_event(p->pasid);
>>
>> This makes sense. It basically tells user mode that the application's
>> GPU state is lost due to a RAS error or a GPU reset, or now a GPU
>> hot-unplug.
>
> The problem is that it cannot find an event with a type that matches
> HW_EXCEPTION_TYPE so it does **nothing** from the driver with the
> default parameter value of send_sigterm = false;
> After all, if a “zombie” process (zombie in the sense it does not have
> a GPU dev) does not exit, kfd resources seems not been released
> properly and new kfd process cannot run after plug back.
> (I still need to look hard into rocr/hsakmt/kfd driver code to
> understand the reason. At least I am seeing that the kfd topology
> won’t be cleaned up without process exiting, so that there would be a
> “zombie" kfd node in the topology, which may or may not cause issues
> in hsakmt).
> @Felix Do you have suggestion/insight on this “zombie" process issue?
> @Andrey suggests it should be OK to have a “zombie” kfd process and a
> “zombie” kfd dev, and the new kfd process should be ok to run on the
> new kfd dev after plugback.
My experience with the graphic stack at least showed that. At least in a
setup with 2 GPUs, if i remove a secondary GPU which had a rendering
process on it, I could plug back the secondary GPU and start a new
rendering process while the old zombie process was still present. It
could be that in KFD case there are some obstacles to this that need to
be resolved.
Andrey
>
> May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel
> restore_userptr_work
> May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw
> exception to pasid = 0x800
> May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu:
> Process 25894 (pasid 0x8001) got unhandled exception
>
>>
>>
>>> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
>>
>> This does not make sense. A VM fault indicates an access to a bad
>> virtual address by the GPU. If a debugger is attached to the process, it
>> notifies the debugger to investigate what went wrong. If the GPU is
>> gone, that doesn't make any sense. There is no GPU that could have
>> issued a bad memory request. And the debugger won't be happy either to
>> find a VM fault from a GPU that doesn't exist any more.
>
> OK understood.
>
>>
>> If the HW-exception event doesn't terminate your process, we may need to
>> look into how ROCr handles the HW-exception events.
>>
>>
>>> + }
>>> + }
>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>> + }
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>> kfd_iommu_suspend(kfd);
>>
>> Should DQM stop and IOMMU suspend still be executed? Or should the
>> hot-unplug case short-circuit them?
>
> I tried short circuiting them, but would later caused BUG related to
> GPU reset. I added the following that solve the issue on plugout.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..d78a06d74759 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5317,7 +5317,8 @@ static void
> amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> {
> struct amdgpu_recover_work_struct *recover_work =
> container_of(work, struct amdgpu_recover_work_struct, base);
>
> - recover_work->ret =
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> + if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
> + recover_work->ret =
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> }
> /*
> * Serialize gpu recover into reset domain single threaded wq
>
> However after kill the zombie process, it failed to evict queues of
> the process.
>
> [ +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
> [ +9.002503] amdgpu: qcm fence wait loop timeout expired
> [ +0.001364] amdgpu: The cp might be in an unrecoverable state due to
> an unsuccessful queues preemption
> [ +0.001343] amdgpu: Failed to evict process queues
> [ +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
>
>
> This would cause driver BUG triggered by new kfd process after
> plugback. I am pasting the errors from dmesg after plugback as below.
>
>
>
> May 11 10:25:16 NETSYS26 kernel: [ 688.445332] amdgpu: Evicting PASID
> 0x8001 queues
> May 11 10:25:16 NETSYS26 kernel: [ 688.445359] BUG: unable to handle
> page fault for address: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [ 688.447516] #PF: supervisor read
> access in kernel mode
> May 11 10:25:16 NETSYS26 kernel: [ 688.449627] #PF:
> error_code(0x0000) - not-present page
> May 11 10:25:16 NETSYS26 kernel: [ 688.451661] PGD 80000020892a8067
> P4D 80000020892a8067 PUD 0
> May 11 10:25:16 NETSYS26 kernel: [ 688.453741] Oops: 0000 [#1]
> PREEMPT SMP PTI
> May 11 10:25:16 NETSYS26 kernel: [ 688.455904] CPU: 25 PID: 9236
> Comm: tf_cnn_benchmar Tainted: G W OE 5.16.0+ #3
> May 11 10:25:16 NETSYS26 kernel: [ 688.457406] amdgpu 0000:05:00.0:
> amdgpu: GPU reset begin!
> May 11 10:25:16 NETSYS26 kernel: [ 688.457798] Hardware name: Dell
> Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
> May 11 10:25:16 NETSYS26 kernel: [ 688.461458] RIP:
> 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.465238] Code: bd 13 8a dd 85
> c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00
> 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74
> ea c6 43 6e 00 41 83 ac 24 70 01 00 00
> May 11 10:25:16 NETSYS26 kernel: [ 688.470516] RSP:
> 0018:ffffb2674c8afbf0 EFLAGS: 00010203
> May 11 10:25:16 NETSYS26 kernel: [ 688.473255] RAX: ffff91c65cca3800
> RBX: 0000000200000000 RCX: 0000000000000001
> May 11 10:25:16 NETSYS26 kernel: [ 688.475691] RDX: 0000000000000000
> RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
> May 11 10:25:16 NETSYS26 kernel: [ 688.478564] RBP: ffffb2674c8afc20
> R08: 0000000000000000 R09: 000000000006ba18
> May 11 10:25:16 NETSYS26 kernel: [ 688.481409] R10: 00007fe5a0000000
> R11: ffffb2674c8af918 R12: ffff91c66d6f5800
> May 11 10:25:16 NETSYS26 kernel: [ 688.484254] R13: ffff91c66d6f5938
> R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
> May 11 10:25:16 NETSYS26 kernel: [ 688.487184] FS:
> 00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
> May 11 10:25:16 NETSYS26 kernel: [ 688.490308] CS: 0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> May 11 10:25:16 NETSYS26 kernel: [ 688.493122] CR2: 000000020000006e
> CR3: 0000002095284004 CR4: 00000000001706e0
> May 11 10:25:16 NETSYS26 kernel: [ 688.496142] Call Trace:
> May 11 10:25:16 NETSYS26 kernel: [ 688.499199] <TASK>
> May 11 10:25:16 NETSYS26 kernel: [ 688.502261]
> kfd_process_evict_queues+0x43/0xf0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.506378]
> kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.510539]
> amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.514110]
> amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.518247]
> __mmu_notifier_invalidate_range_start+0x136/0x1e0
> May 11 10:25:16 NETSYS26 kernel: [ 688.521252]
> change_protection+0x41d/0xcd0
> May 11 10:25:16 NETSYS26 kernel: [ 688.524310]
> change_prot_numa+0x19/0x30
> May 11 10:25:16 NETSYS26 kernel: [ 688.527366]
> task_numa_work+0x1ca/0x330
> May 11 10:25:16 NETSYS26 kernel: [ 688.530157] task_work_run+0x6c/0xa0
> May 11 10:25:16 NETSYS26 kernel: [ 688.533124]
> exit_to_user_mode_prepare+0x1af/0x1c0
> May 11 10:25:16 NETSYS26 kernel: [ 688.536058]
> syscall_exit_to_user_mode+0x2a/0x40
> May 11 10:25:16 NETSYS26 kernel: [ 688.538989] do_syscall_64+0x46/0xb0
> May 11 10:25:16 NETSYS26 kernel: [ 688.541830]
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> May 11 10:25:16 NETSYS26 kernel: [ 688.544701] RIP: 0033:0x7fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [ 688.547297] Code: b3 66 90 48 8b
> 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f
> 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3
> 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
> May 11 10:25:16 NETSYS26 kernel: [ 688.553183] RSP:
> 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> May 11 10:25:16 NETSYS26 kernel: [ 688.556105] RAX: ffffffffffffffc2
> RBX: 0000000000000000 RCX: 00007fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [ 688.558970] RDX: 00007fe621249540
> RSI: 00000000c0584b02 RDI: 0000000000000003
> May 11 10:25:16 NETSYS26 kernel: [ 688.561950] RBP: 00007fe621249540
> R08: 0000000000000000 R09: 0000000000040000
> May 11 10:25:16 NETSYS26 kernel: [ 688.564563] R10: 00007fe617480000
> R11: 0000000000000246 R12: 00000000c0584b02
> May 11 10:25:16 NETSYS26 kernel: [ 688.567494] R13: 0000000000000003
> R14: 0000000000000064 R15: 00007fe621249920
> May 11 10:25:16 NETSYS26 kernel: [ 688.570470] </TASK>
> May 11 10:25:16 NETSYS26 kernel: [ 688.573380] Modules linked in:
> amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype
> br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat
> nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
> ebtables ip6table_filter ip6_tables iptable_filter overlay
> esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr
> intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal
> snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp
> snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer
> snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev
> mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si
> ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel
> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic
> zstd_compress raid10 raid456
> May 11 10:25:16 NETSYS26 kernel: [ 688.573543] async_raid6_recov
> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
> raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm
> drm_shmem_helper drm_kms_helper syscopyarea hid_generic
> crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel
> sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid
> drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last
> unloaded: amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [ 688.611083] CR2: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [ 688.614454] ---[ end trace
> 349cf28efb6268bc ]—
>
> Looking forward to the comments.
>
> Regards,
> Shuotao
>
>>
>> Regards,
>> Felix
>>
>>
>>> }
>>>
>>> Regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>
>>>>> [ +0.001645] BUG: unable to handle page fault for address:
>>>>> 0000000000058a68
>>>>> [ +0.001298] #PF: supervisor read access in kernel mode
>>>>> [ +0.001252] #PF: error_code(0x0000) - not-present page
>>>>> [ +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
>>>>> 109b2d067 PMD 0
>>>>> [ +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>> [ +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
>>>>> W E 5.16.0+ #3
>>>>> [ +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
>>>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>>>> [ +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [ +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [ +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [ +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [ +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [ +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [ +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [ +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [ +0.001397] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [ +0.001411] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [ +0.001422] Call Trace:
>>>>> [ +0.001407] <TASK>
>>>>> [ +0.001391] amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>>>> [ +0.001614] amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>>>> [ +0.001735] phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
>>>>> [ +0.001790] phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>>>> [ +0.001800] vega20_wait_for_response+0x28/0x80 [amdgpu]
>>>>> [ +0.001757] vega20_send_msg_to_smc_with_parameter+0x21/0x110
>>>>> [amdgpu]
>>>>> [ +0.001838] smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
>>>>> [ +0.001829] ? kvfree+0x1e/0x30
>>>>> [ +0.001462] vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>>>> [ +0.001868] ? kvfree+0x1e/0x30
>>>>> [ +0.001462] ? ttm_bo_release+0x261/0x370 [ttm]
>>>>> [ +0.001467] pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>>>> [ +0.001863] amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>>>> [ +0.001866] amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>>>> [ +0.001784] kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>>>> [ +0.001744] process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>>>> [ +0.001728] kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
>>>>> [ +0.001730] kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>>>> [ +0.001718] __mmu_notifier_release+0x77/0x1f0
>>>>> [ +0.001411] exit_mmap+0x1b5/0x200
>>>>> [ +0.001396] ? __switch_to+0x12d/0x3e0
>>>>> [ +0.001388] ? __switch_to_asm+0x36/0x70
>>>>> [ +0.001372] ? preempt_count_add+0x74/0xc0
>>>>> [ +0.001364] mmput+0x57/0x110
>>>>> [ +0.001349] do_exit+0x33d/0xc20
>>>>> [ +0.001337] ? _raw_spin_unlock+0x1a/0x30
>>>>> [ +0.001346] do_group_exit+0x43/0xa0
>>>>> [ +0.001341] get_signal+0x131/0x920
>>>>> [ +0.001295] arch_do_signal_or_restart+0xb1/0x870
>>>>> [ +0.001303] ? do_futex+0x125/0x190
>>>>> [ +0.001285] exit_to_user_mode_prepare+0xb1/0x1c0
>>>>> [ +0.001282] syscall_exit_to_user_mode+0x2a/0x40
>>>>> [ +0.001264] do_syscall_64+0x46/0xb0
>>>>> [ +0.001236] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [ +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>>>> [ +0.001177] Code: Unable to access opcode bytes at RIP
>>>>> 0x7f6aff1d2aa9.
>>>>> [ +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
>>>>> 00000000000000ca
>>>>> [ +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
>>>>> 00007f6aff1d2ad3
>>>>> [ +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
>>>>> 0000000004f542d8
>>>>> [ +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [ +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>> 0000000004f542d8
>>>>> [ +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
>>>>> 0000000000000000
>>>>> [ +0.001152] </TASK>
>>>>> [ +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
>>>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
>>>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
>>>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
>>>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
>>>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
>>>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
>>>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore
>>>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support
>>>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
>>>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
>>>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
>>>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
>>>>> blake2b_generic zstd_compress raid10 raid456
>>>>> [ +0.000102] async_raid6_recov async_memcpy async_pq async_xor
>>>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid
>>>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
>>>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>>>> [ +0.016626] CR2: 0000000000058a68
>>>>> [ +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>>>> [ +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [ +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [ +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [ +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [ +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [ +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [ +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [ +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [ +0.001648] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [ +0.001668] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [ +0.001740] Fixing recursive fault but reboot is needed!
>>>>>
>>>>>
>>>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>>>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>>>
>>>>>> I retested hot plug tests at the commit I mentioned bellow - looks
>>>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older
>>>>>> Polaris ASICs (whatever i had at home at the time). It's possible
>>>>>> there are extra issues in ASICs like ur which I didn't cover during
>>>>>> tests.
>>>>>>
>>>>>> andrey at andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support VCE, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> Peer device is not opened or has ASIC not supported by the suite,
>>>>>> skip all Peer to Peer tests.
>>>>>>
>>>>>>
>>>>>> CUnit - A unit testing framework for C - Version 2.1-3
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&reserved=0
>>>>>> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kzNRa9d46sBwZCVhu9%2BEkK%2F3f7fyjAo%2BAADtgeoz2l8%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>> *Suite: Hotunplug Tests**
>>>>>> ** Test: Unplug card and rescan the bus to plug it back
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Same as first test but with command submission
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Unplug with exported bo
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed*
>>>>>>
>>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>>> suites 14 1 n/a 0 0
>>>>>> tests 71 3 3 0 1
>>>>>> asserts 21 21 21 0 n/a
>>>>>>
>>>>>> Elapsed time = 9.195 seconds
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>>>
>>>>>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen
>>>>>>> yet but I also haven't tested MI200 so never saw it before. Need
>>>>>>> to test when i get the time.
>>>>>>>
>>>>>>> So try that fix with Radeon 7 again to see if you pass the tests
>>>>>>> (the warnings should all be minor issues).
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>> That a problem, latest working baseline I tested and confirmed
>>>>>>>>> passing hotplug tests is this branch and
>>>>>>>>> commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0
>>>>>>>>> <commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&reserved=0>
>>>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
>>>>>>>>> hotplug code but it had a lot of regressions over time due to
>>>>>>>>> new changes (that why I added the hotplug test to try and catch
>>>>>>>>> them early). It would be best to run this branch on mi-100 so we
>>>>>>>>> have a clean baseline and only after confirming this particular
>>>>>>>>> branch from this commits passes libdrm tests only then start
>>>>>>>>> adding the KFD specific addons. Another option if you can't work
>>>>>>>>> with MI-100 and this branch is to try a different ASIC that does
>>>>>>>>> work with this branch (if possible).
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next
>>>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
>>>>>>>> libdrm test. I might be able to gain access to MI200, but I
>>>>>>>> suspect it would work.
>>>>>>>>
>>>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Radeon VII:
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220511/435bb8a3/attachment-0001.htm>
More information about the amd-gfx
mailing list