[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Apr 13 17:31:31 UTC 2022
On 2022-04-13 12:03, Shuotao Xu wrote:
>
>
>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky
>> <andrey.grodzovsky at amd.com> wrote:
>>
>> [Some people who received this message don't often get email
>> fromandrey.grodzovsky at amd.com. Learn why this is important
>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>
>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>
>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky
>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>
>>>> [Some people who received this message don't often get email from
>>>> andrey.grodzovsky at amd.com. Learn why this is important at
>>>> http://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
>>>>> devices can open doors for many advanced applications in data center
>>>>> in the next few years, such as for GPU resource
>>>>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>>>>> following reasons:
>>>>>
>>>>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>> fail.
>>>> I assumed you read my comment last time, still you do same approach.
>>>> More in details bellow
>>> Aha, I like your fix:) I was not familiar with drm APIs so just only
>>> half understood your comment last time.
>>>
>>> BTW, I tried hot-plugging out a GPU when rocm application is still
>>> running.
>>> From dmesg, application is still trying to access the removed kfd
>>> device, and are met with some errors.
>>
>>
>> Application us supposed to keep running, it holds the drm_device
>> reference as long as it has an open
>> FD to the device and final cleanup will come only after the app will die
>> thus releasing the FD and the last
>> drm_device reference.
>>
>>> Application would hang and not exiting in this case.
>>
>
> Actually I tried kill -7 $pid, and the process exists. The dmesg has
> some warning though.
>
> [ 711.769977] WARNING: CPU: 23 PID: 344 at
> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336
> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
> [ 711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE)
> amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo
> xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE
> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac
> nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma
> kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev
> acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei
> intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca
> acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm
> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub
> ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10
> raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> [ 711.779359] raid6_pq libcrc32c raid1 raid0 multipath linear
> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit
> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul
> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt
> aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e
> pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
> [ 711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G W
> OE 5.11.0+ #1
> [ 711.779755] Hardware name: Supermicro
> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
> [ 711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
> [ 711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
> [ 711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f
> 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd
> f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00
> 00 55
> [ 711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
> [ 711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX:
> 0000000000000000
> [ 711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI:
> ffff89a8f9ad8870
> [ 711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09:
> fffffffffff99b18
> [ 711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12:
> ffff89980e792000
> [ 711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15:
> dead000000000100
> [ 711.780152] FS: 0000000000000000(0000) GS:ffff89a8f9ac0000(0000)
> knlGS:0000000000000000
> [ 711.780154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4:
> 00000000007706e0
> [ 711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 711.780160] PKRU: 55555554
> [ 711.780161] Call Trace:
> [ 711.780163] ttm_bo_release+0x2ae/0x320 [amdttm]
> [ 711.780169] amdttm_bo_put+0x30/0x40 [amdttm]
> [ 711.780357] amdgpu_bo_unref+0x1e/0x30 [amdgpu]
> [ 711.780543] amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
> [ 711.781119] drm_gem_object_free+0x1d/0x30 [drm]
> [ 711.781489] amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380
> [amdgpu]
> [ 711.782044] kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
> [ 711.782611] kfd_process_wq_release+0x286/0x380 [amdgpu]
> [ 711.783172] process_one_work+0x236/0x420
> [ 711.783543] worker_thread+0x34/0x400
> [ 711.783911] ? process_one_work+0x420/0x420
> [ 711.784279] kthread+0x126/0x140
> [ 711.784653] ? kthread_park+0x90/0x90
> [ 711.785018] ret_from_fork+0x22/0x30
> [ 711.785387] ---[ end trace d8f50f6594817c84 ]---
> [ 711.798716] [drm] amdgpu: ttm finalized
So it means the process was stuck in some wait_event_killable (maybe
here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack'
maybe before
you kill it to see where it was stuck so we can go from there.
>
>>
>> For graphic apps what i usually see is a crash because of sigsev when
>> the app tries to access
>> an unmapped MMIO region on the device. I haven't tested for compute
>> stack and so there might
>> be something I haven't covered. Hang could mean for example waiting on a
>> fence which is not being
>> signaled - please provide full dmesg from this case.
>>
>>>
>>> Do you have any good suggestions on how to fix it down the line?
>>> (HIP runtime/libhsakmt or driver)
>>>
>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed
>>
>
> The full dmesg will just the repetition of those two messages,
> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
> [186885.766916] [drm] free PSP TMR buffer
> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed
>
>>
>> This just probably means trying to update PTEs after the physical device
>> is gone - we usually avoid this by
>> first trying to do all HW shutdowns early before PCI remove completion
>> but when it's really tricky by
>> protecting HW access sections with drm_dev_enter/exit scope.
>>
>> For this particular error it would be the best to flush
>> info->restore_userptr_work before the end of
>> amdgpu_pci_remove (rejecting new process creation and calling
>> cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
>> running processes)
>> somewhere in amdgpu_pci_remove.
>>
> I tried something like *kfd_process_ref_release* which I think did
> what you described, but it did not work.
I don't see how kfd_process_ref_release is the same as I mentioned
above, what i meant is calling the code above within kgd2kfd_suspend
(where you tried to call kfd_kill_all_user_processes bellow)
>
> Instead I tried to kill the process from the kernel, but the amdgpu
> could **only** be hot-plugged in back successfully only if there was
> no rocm kernel running when it was plugged out. If not, amdgpu_probe
> will just hang later. (Maybe because amdgpu was plugged out while
> running state, it leaves a bad HW state which causes probe to hang).
We usually do asic_reset during probe to reset all HW state (checlk if
amdgpu_device_init->amdgpu_asic_reset is running when you plug back).
>
> I don’t know if this is a viable solution worth pursuing, but I
> attached the diff anyway.
>
> Another solution could be let compute stack user mode detect a
> topology change via generation_count change, and abort gracefully there.
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 4e7d9cb09a69..79b4c9b84cd0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool
> run_pm, bool force)
> return;
>
> /* for runtime suspend, skip locking kfd */
> - if (!run_pm) {
> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
> /* For first KFD device suspend all the KFD processes */
> if (atomic_inc_return(&kfd_locked) == 1)
> kfd_suspend_all_processes(force);
> }
>
> + if (drm_dev_is_unplugged(kfd->ddev))
> + kfd_kill_all_user_processes();
> +
> kfd->dqm->ops.stop(kfd->dqm);
> kfd_iommu_suspend(kfd);
> }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 55c9e1922714..84cbcd857856 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
> int kfd_process_evict_queues(struct kfd_process *p, bool force);
> int kfd_process_restore_queues(struct kfd_process *p);
> void kfd_suspend_all_processes(bool force);
> +void kfd_kill_all_user_processes(void);
> /*
> * kfd_resume_all_processes:
> * bool sync: If kfd_resume_all_processes() should wait for the
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 6cdc855abb6d..fb0c753b682c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
> srcu_read_unlock(&kfd_processes_srcu, idx);
> }
>
> +
> +void kfd_kill_all_user_processes(void)
> +{
> + struct kfd_process *p;
> + struct amdkfd_process_info *p_info;
> + unsigned int temp;
> + int idx = srcu_read_lock(&kfd_processes_srcu);
> +
> + pr_info("Killing all processes\n");
> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> + p_info = p->kgd_process_info;
> + pr_info("Killing processes, pid = %d",
> pid_nr(p_info->pid));
> + kill_pid(p_info->pid, SIGBUS, 1);
From looking into kill_pid I see it only sends a signal but doesn't
wait for completion, it would make sense to wait for completion here. In
any case I would actually try to put here
hash_for_each_rcu(p_info)
cancel_delayed_work_sync(&p_info->restore_userptr_work)
instead at least that what i meant in the previous mail.
Andrey
> + }
> + srcu_read_unlock(&kfd_processes_srcu, idx);
> +}
> +
> +
> int kfd_resume_all_processes(bool sync)
> {
> struct kfd_process *p;
>
>
>> Andrey
>>
>>
>>>
>>> Really appreciate your help!
>>>
>>> Best,
>>> Shuotao
>>>
>>>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>>> out.
>>>>>
>>>>> 3. New kfd node_id is not properly assigned after a new device is
>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>> stack.
>>>>>
>>>>> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>> 4 files changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device
>>>>> *adev, bool run_pm)
>>>>> return r;
>>>>> }
>>>>>
>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>> +{
>>>>> + return kgd2kfd_resume_processes();
>>>>> +}
>>>>> +
>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>> {
>>>>> int r = 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>> const void *ih_ring_entry);
>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>> +int kgd2kfd_resume_processes(void);
>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
>>>>> *ih_ring_entry);
>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct
>>>>> kfd_dev *kfd, bool run_pm)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>> +{
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>> {
>>>>> return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct
>>>>> amdgpu_device *adev)
>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>
>>>>> + amdgpu_amdkfd_resume_processes();
>>>>> }
>>>>>
>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool
>>>>> run_pm)
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +/* for non-runtime resume only */
>>>>> +int kgd2kfd_resume_processes(void)
>>>>> +{
>>>>> + int count;
>>>>> +
>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> + if (count == 0)
>>>>> + return kfd_resume_all_processes();
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>
>>>> It doesn't make sense to me to just increment kfd_locked in
>>>> kgd2kfd_suspend to only decrement it again a few functions down the
>>>> road.
>>>>
>>>> I suggest this instead - you only incrmemnt if not during PCI remove
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>
>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>> {
>>>> +
>>>> if (!kfd->init_complete)
>>>> return;
>>>>
>>>> /* for runtime suspend, skip locking kfd */
>>>> - if (!run_pm) {
>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>> /* For first KFD device suspend all the KFD processes */
>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>> kfd_suspend_all_processes();
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>> +
>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>> {
>>>>> int err = 0;
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220413/7c57b2ea/attachment-0001.htm>
More information about the amd-gfx
mailing list