[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Mon Apr 18 15:23:21 UTC 2022
On 2022-04-18 09:22, Shuotao Xu wrote:
>
>
>> On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky
>> <andrey.grodzovsky at amd.com> wrote:
>>
>>
>> On 2022-04-15 06:12, Shuotao Xu wrote:
>>> Hi Andrey,
>>>
>>> First I really appreciate the discussion! It helped me understand
>>> the driver code greatly. Thank you so much:)
>>> Please see my inline comments.
>>>
>>>> On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky
>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>
>>>>
>>>> On 2022-04-14 10:00, Shuotao Xu wrote:
>>>>>
>>>>>
>>>>>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky
>>>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky
>>>>>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>>>>>
>>>>>>>> [Some people who received this message don't often get email
>>>>>>>> fromandrey.grodzovsky at amd.com. Learn why this is important
>>>>>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>
>>>>>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky
>>>>>>>>>> <andrey.grodzovsky at amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> [Some people who received this message don't often get email
>>>>>>>>>> from andrey.grodzovsky at amd.com. Learn why this is important
>>>>>>>>>> at http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>>>
>>>>>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of
>>>>>>>>>>> hot-plug of GPU
>>>>>>>>>>> devices can open doors for many advanced applications in
>>>>>>>>>>> data center
>>>>>>>>>>> in the next few years, such as for GPU resource
>>>>>>>>>>> disaggregation. Current AMDKFD does not support hotplug out
>>>>>>>>>>> b/o the
>>>>>>>>>>> following reasons:
>>>>>>>>>>>
>>>>>>>>>>> 1. During PCIe removal, decrement KFD lock which was
>>>>>>>>>>> incremented at
>>>>>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>>>>>> fail.
>>>>>>>>>> I assumed you read my comment last time, still you do same
>>>>>>>>>> approach.
>>>>>>>>>> More in details bellow
>>>>>>>>> Aha, I like your fix:) I was not familiar with drm APIs so
>>>>>>>>> just only half understood your comment last time.
>>>>>>>>>
>>>>>>>>> BTW, I tried hot-plugging out a GPU when rocm application is
>>>>>>>>> still running.
>>>>>>>>> From dmesg, application is still trying to access the removed
>>>>>>>>> kfd device, and are met with some errors.
>>>>>>>>
>>>>>>>>
>>>>>>>> Application us supposed to keep running, it holds the drm_device
>>>>>>>> reference as long as it has an open
>>>>>>>> FD to the device and final cleanup will come only after the app
>>>>>>>> will die
>>>>>>>> thus releasing the FD and the last
>>>>>>>> drm_device reference.
>>>>>>>>
>>>>>>>>> Application would hang and not exiting in this case.
>>>>>>>>
>>>>>>>
>>>>>>> Actually I tried kill -7 $pid, and the process exists. The dmesg
>>>>>>> has some warning though.
>>>>>>>
>>>>>>> [ 711.769977] WARNING: CPU: 23 PID: 344 at
>>>>>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336
>>>>>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>>>>>> [ 711.770528] Modules linked in: amdgpu(OE) amdttm(OE)
>>>>>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink
>>>>>>> xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM
>>>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT
>>>>>>> nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables
>>>>>>> ip6table_filter ip6_tables iptable_filter overlay binfmt_misc
>>>>>>> intel_rapl_msr i40iw intel_rapl_common skx_edac nfit
>>>>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma
>>>>>>> kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl
>>>>>>> joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf
>>>>>>> mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid
>>>>>>> lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser
>>>>>>> rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi
>>>>>>> scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs
>>>>>>> blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
>>>>>>> async_memcpy async_pq async_xor async_tx xor
>>>>>>> [ 711.779359] raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>>>> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit
>>>>>>> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul
>>>>>>> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic
>>>>>>> sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid
>>>>>>> cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw
>>>>>>> hid ahci libahci wmi
>>>>>>> [ 711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G W
>>>>>>> OE 5.11.0+ #1
>>>>>>> [ 711.779755] Hardware name: Supermicro
>>>>>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>>>>>> [ 711.779756] Workqueue: kfd_process_wq kfd_process_wq_release
>>>>>>> [amdgpu]
>>>>>>> [ 711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160
>>>>>>> [amdgpu]
>>>>>>> [ 711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74
>>>>>>> 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff
>>>>>>> e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00
>>>>>>> 00 00 0f 1f 44 00 00 55
>>>>>>> [ 711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>>>>>> [ 711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX:
>>>>>>> 0000000000000000
>>>>>>> [ 711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI:
>>>>>>> ffff89a8f9ad8870
>>>>>>> [ 711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09:
>>>>>>> fffffffffff99b18
>>>>>>> [ 711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12:
>>>>>>> ffff89980e792000
>>>>>>> [ 711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15:
>>>>>>> dead000000000100
>>>>>>> [ 711.780152] FS: 0000000000000000(0000)
>>>>>>> GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
>>>>>>> [ 711.780154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [ 711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4:
>>>>>>> 00000000007706e0
>>>>>>> [ 711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>> 0000000000000000
>>>>>>> [ 711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>> 0000000000000400
>>>>>>> [ 711.780160] PKRU: 55555554
>>>>>>> [ 711.780161] Call Trace:
>>>>>>> [ 711.780163] ttm_bo_release+0x2ae/0x320 [amdttm]
>>>>>>> [ 711.780169] amdttm_bo_put+0x30/0x40 [amdttm]
>>>>>>> [ 711.780357] amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>>>>>> [ 711.780543] amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>>>>>> [ 711.781119] drm_gem_object_free+0x1d/0x30 [drm]
>>>>>>> [ 711.781489]
>>>>>>> amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
>>>>>>> [ 711.782044] kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>>>>>> [ 711.782611] kfd_process_wq_release+0x286/0x380 [amdgpu]
>>>>>>> [ 711.783172] process_one_work+0x236/0x420
>>>>>>> [ 711.783543] worker_thread+0x34/0x400
>>>>>>> [ 711.783911] ? process_one_work+0x420/0x420
>>>>>>> [ 711.784279] kthread+0x126/0x140
>>>>>>> [ 711.784653] ? kthread_park+0x90/0x90
>>>>>>> [ 711.785018] ret_from_fork+0x22/0x30
>>>>>>> [ 711.785387] ---[ end trace d8f50f6594817c84 ]---
>>>>>>> [ 711.798716] [drm] amdgpu: ttm finalized
>>>>>>
>>>>>>
>>>>>> So it means the process was stuck in some wait_event_killable
>>>>>> (maybe here drm_sched_entity_flush) - you can try
>>>>>> 'cat/proc/$process_pid/stack' maybe before
>>>>>> you kill it to see where it was stuck so we can go from there.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> For graphic apps what i usually see is a crash because of
>>>>>>>> sigsev when
>>>>>>>> the app tries to access
>>>>>>>> an unmapped MMIO region on the device. I haven't tested for compute
>>>>>>>> stack and so there might
>>>>>>>> be something I haven't covered. Hang could mean for example
>>>>>>>> waiting on a
>>>>>>>> fence which is not being
>>>>>>>> signaled - please provide full dmesg from this case.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you have any good suggestions on how to fix it down the
>>>>>>>>> line? (HIP runtime/libhsakmt or driver)
>>>>>>>>>
>>>>>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>>>> failed
>>>>>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>>>> failed
>>>>>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>>>> failed
>>>>>>>>
>>>>>>>
>>>>>>> The full dmesg will just the repetition of those two messages,
>>>>>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing
>>>>>>> device.
>>>>>>> [186885.766916] [drm] free PSP TMR buffer
>>>>>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>> failed
>>>>>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>> failed
>>>>>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE
>>>>>>> failed
>>>>>>>
>>>>>>>>
>>>>>>>> This just probably means trying to update PTEs after the
>>>>>>>> physical device
>>>>>>>> is gone - we usually avoid this by
>>>>>>>> first trying to do all HW shutdowns early before PCI remove
>>>>>>>> completion
>>>>>>>> but when it's really tricky by
>>>>>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>>>>>
>>>>>>>> For this particular error it would be the best to flush
>>>>>>>> info->restore_userptr_work before the end of
>>>>>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>>>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work)
>>>>>>>> for all
>>>>>>>> running processes)
>>>>>>>> somewhere in amdgpu_pci_remove.
>>>>>>>>
>>>>>>> I tried something like *kfd_process_ref_release* which I think
>>>>>>> did what you described, but it did not work.
>>>>>>
>>>>>>
>>>>>> I don't see how kfd_process_ref_release is the same as I
>>>>>> mentioned above, what i meant is calling the code above within
>>>>>> kgd2kfd_suspend (where you tried to call
>>>>>> kfd_kill_all_user_processes bellow)
>>>>>>
>>>>> Yes, you are right. It was not called by it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Instead I tried to kill the process from the kernel, but the
>>>>>>> amdgpu could **only** be hot-plugged in back successfully only
>>>>>>> if there was no rocm kernel running when it was plugged out. If
>>>>>>> not, amdgpu_probe will just hang later. (Maybe because amdgpu
>>>>>>> was plugged out while running state, it leaves a bad HW state
>>>>>>> which causes probe to hang).
>>>>>>
>>>>>>
>>>>>> We usually do asic_reset during probe to reset all HW state
>>>>>> (checlk if amdgpu_device_init->amdgpu_asic_reset is running when
>>>>>> you plug back).
>>>>>>
>>>>> OK
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I don’t know if this is a viable solution worth pursuing, but I
>>>>>>> attached the diff anyway.
>>>>>>>
>>>>>>> Another solution could be let compute stack user mode detect a
>>>>>>> topology change via generation_count change, and abort
>>>>>>> gracefully there.
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd,
>>>>>>> bool run_pm, bool force)
>>>>>>> return;
>>>>>>>
>>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>>> - if (!run_pm) {
>>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>> /* For first KFD device suspend all the KFD
>>>>>>> processes */
>>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>> kfd_suspend_all_processes(force);
>>>>>>> }
>>>>>>>
>>>>>>> + if (drm_dev_is_unplugged(kfd->ddev))
>>>>>>> + kfd_kill_all_user_processes();
>>>>>>> +
>>>>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>>>>> kfd_iommu_suspend(kfd);
>>>>>>> }
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> index 55c9e1922714..84cbcd857856 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>>>>>>> int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>>> int kfd_process_restore_queues(struct kfd_process *p);
>>>>>>> void kfd_suspend_all_processes(bool force);
>>>>>>> +void kfd_kill_all_user_processes(void);
>>>>>>> /*
>>>>>>> * kfd_resume_all_processes:
>>>>>>> * bool sync: If kfd_resume_all_processes() should wait for the
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> index 6cdc855abb6d..fb0c753b682c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>>>>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>> }
>>>>>>>
>>>>>>> +
>>>>>>> +void kfd_kill_all_user_processes(void)
>>>>>>> +{
>>>>>>> + struct kfd_process *p;
>>>>>>> + struct amdkfd_process_info *p_info;
>>>>>>> + unsigned int temp;
>>>>>>> + int idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>> +
>>>>>>> + pr_info("Killing all processes\n");
>>>>>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>>>> + p_info = p->kgd_process_info;
>>>>>>> + pr_info("Killing processes, pid = %d", pid_nr(p_info->pid));
>>>>>>> + kill_pid(p_info->pid, SIGBUS, 1);
>>>>>>
>>>>>>
>>>>>> From looking into kill_pid I see it only sends a signal but
>>>>>> doesn't wait for completion, it would make sense to wait for
>>>>>> completion here. In any case I would actually try to put here
>>>>>>
>>>>> I have made a version which does that with some atomic counters.
>>>>> Please read later in the diff.
>>>>>>
>>>>>>
>>>>>> hash_for_each_rcu(p_info)
>>>>>> cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>>>>>
>>>>>> instead at least that what i meant in the previous mail.
>>>>>>
>>>>> I actually tried that earlier, and it did not work. Application
>>>>> still keeps running, and you have to send a kill to the user process.
>>>>>
>>>>> I have made the following version. It waits for processes to
>>>>> terminate synchronously after sending SIGBUS. After that it does
>>>>> the real work of amdgpu_pci_remove.
>>>>> However, it hangs at amdgpu_device_ip_fini_early when it is trying
>>>>> to deinit ip_block 6 <sdma_v4_0>
>>>>> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818
>>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbfb611b3f5984b73ad3708da213e781e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637858849473779872%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Tn6rrnnrg%2BzkoLQUeN3WOjqDbgKD5Rotccd6AZs%2BQmI%3D&reserved=0>).
>>>>> I assume that there are still some inflight dma, therefore fini of
>>>>> this ip block thus hangs?
>>>>>
>>>>> The following is an excerpt of the dmesg: please excuse for
>>>>> putting my own pr_info, but I hope you get my point of where it hangs.
>>>>>
>>>>> [ 392.344735] amdgpu: all processes has been fully released
>>>>> [ 392.346557] amdgpu: amdgpu_acpi_fini done
>>>>> [ 392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
>>>>> [ 392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
>>>>> [ 392.349248] amdgpu: Free mem_obj = 000000007bf54275,
>>>>> range_start = 14, range_end = 14
>>>>> [ 392.350299] amdgpu: Free mem_obj = 00000000a85bc878,
>>>>> range_start = 12, range_end = 12
>>>>> [ 392.350304] amdgpu: Free mem_obj = 00000000b8019e32,
>>>>> range_start = 13, range_end = 13
>>>>> [ 392.350308] amdgpu: Free mem_obj = 000000002d296168,
>>>>> range_start = 4, range_end = 11
>>>>> [ 392.350313] amdgpu: Free mem_obj = 000000001fc4f934,
>>>>> range_start = 0, range_end = 3
>>>>> [ 392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
>>>>> [ 392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
>>>>> [ 392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0
>>>>
>>>>
>>>> I just remembered that the idea to actively kill and wait for
>>>> running user processes during unplug was rejected
>>>> as a bad idea in the first iteration of unplug work I did (don't
>>>> remember why now, need to look) so this is a no go.
>>>>
>>> Maybe an application has kfd open, but was not accessing the dev. So
>>> kill it at unplug could kill the process unnecessarily.
>>> However, the latest version I had with the sleep function got rid of
>>> the IP block fini hang.
>>>>
>>>> Our policy is to let zombie processes (zombie in a sense that the
>>>> underlying device is gone) live as long as they want
>>>> (as long as you able to terminate them - which you do, so that ok)
>>>> and the system should finish PCI remove gracefully and be able to
>>>> hot plug back the device. Hence, i suggest dropping
>>>> this direction of forcing all user processes to be killed, confirm
>>>> you have graceful shutdown and remove of device
>>>> from PCI topology and then concentrate on why when you plug back it
>>>> hangs.
>>>>
>>> So I basically revert back to the original solution which you suggested.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 4e7d9cb09a69..5504a18b5a45 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool
>>> run_pm, bool force)
>>> return;
>>>
>>> /* for runtime suspend, skip locking kfd */
>>> - if (!run_pm) {
>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>> /* For first KFD device suspend all the KFD processes */
>>> if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes(force);
>>>
>>>> First confirm if ASIC reset happens on
>>>> next init.
>>>>
>>> This patch works great at *planned* plugout, where all the rocm
>>> processes are killed before plugout. And device can be added back
>>> without a problem.
>>> However *unplanned* plugout when there is rocm processes are running
>>> just don’t work.
>>
>>
>> Still I am not clear if ASIC reset happens on plug back or no, can
>> you trace this please ?
>>
>>
>
> I tried add pr_info into asic_reset functions, but cannot trace any
> upon plug-back.
This could possibly explain the hang on plug back. Can you see why we
don't get there ?
>>
>>>> Second please confirm if the timing you kill manually the user
>>>> process has impact on whether you have a hang
>>>> on next plug back (if you kill before
>>>>
>>> *Scenario 0: Kill before plug back*
>>>
>>> 1. echo 1 > /sys/bus/pci/…/remove, would finish.
>>> But the application won’t exit until there is a kill signal.
>>
>>
>> Why you think it must exit ?
>>
> Because rocm will need to release the drm descriptor to
> get amdgpu_amdkfd_device_fini_sw called, which would eventually call
> kgd2kfd_device_exit called. This would clean up kfd_topology at least.
> Otherwise I don’t see how it would be added back without messing up
> kfd topology to say the least.
>
> However, those are all based my own observations. Please explain why
> it does not need exit if you believe so?
Note that when you add back a new device, pci device and drm device are
created, I am not an expert on KFD code but i believe also a new KFD
device is created independent of the old one and so the topology should
see just 2 device instances (one old zombie and one real new). I know
at least this wasn't an issue for the graphic stack in exact same
scenario and the libdrm tests i pointed to test exact this scenario.
Also note that even with running grpahic stack there is always a KFD
device and KFD topology present but of course probably not the same as
when u run a KFD facing process so there could be some issues there.
Also note that because of this patch
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=267d51d77fdae8708b94e1a24b8e5d961297edb7
all MMIO accesses from such zombie/orphan user processes will be
remapped to zero page and so will not necessarily experience a segfault
when device removal happnes but rather maybe some crash due to NULL data
read from MMIO by the process and used in some manner.
>>>
>>> 2. kill the the process. The application does several things and
>>> seems trigger drm_release in the kernel, which are met with kernel
>>> NULL pointer deference related to sysfs_remove. Then the whole fs
>>> just freeze.
>>>
>>> [ +0.002498] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000098
>>> [ +0.000486] #PF: supervisor read access in kernel mode
>>> [ +0.000545] #PF: error_code(0x0000) - not-present page
>>> [ +0.000551] PGD 0 P4D 0
>>> [ +0.000553] Oops: 0000 [#1] SMP NOPTI
>>> [ +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G
>>> W E 5.13.0-kfd #1
>>> [ +0.000559] Hardware name: INGRASYS TURING /MB ,
>>> BIOS K71FQ28A 10/05/2021
>>> [ +0.000567] Workqueue: events delayed_fput
>>> [ +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
>>> [ +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f
>>> 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48
>>> 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6
>>> 20 41 0f
>>> [ +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
>>> [ +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX:
>>> 0000000000000000
>>> [ +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI:
>>> 0000000000000000
>>> [ +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09:
>>> 0000000000000000
>>> [ +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12:
>>> 0000000000000000
>>> [ +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15:
>>> 0000000000000000
>>> [ +0.000702] FS: 0000000000000000(0000) GS:ffffa0ffbfcc0000(0000)
>>> knlGS:0000000000000000
>>> [ +0.000666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4:
>>> 0000000000770ee0
>>> [ +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [ +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [ +0.000592] PKRU: 55555554
>>> [ +0.000580] Call Trace:
>>> [ +0.000591] kernfs_find_and_get_ns+0x2f/0x50
>>> [ +0.000584] sysfs_remove_file_from_group+0x20/0x50
>>> [ +0.000580] amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
>>> [ +0.000737] amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
>>> [ +0.000750] amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
>>> [ +0.000742] ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
>>> [ +0.000738] sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
>>> [ +0.000717] amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
>>> [ +0.000704] amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>> [ +0.000687] drm_dev_release+0x20/0x40 [drm]
>>> [ +0.000583] drm_release+0xa8/0xf0 [drm]
>>> [ +0.000584] __fput+0xa5/0x250
>>> [ +0.000621] delayed_fput+0x1f/0x30
>>> [ +0.000726] process_one_work+0x26e/0x580
>>> [ +0.000581] ? process_one_work+0x580/0x580
>>> [ +0.000611] worker_thread+0x4d/0x3d0
>>> [ +0.000611] ? process_one_work+0x580/0x580
>>> [ +0.000605] kthread+0x117/0x150
>>> [ +0.000611] ? kthread_park+0x90/0x90
>>> [ +0.000619] ret_from_fork+0x1f/0x30
>>> [ +0.000625] Modules linked in: amdgpu(E) xt_conntrack
>>> xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat
>>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter
>>> x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables
>>> x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm
>>> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect
>>> sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
>>> on_quirks [last unloaded: amdgpu]
>>
>>
>> This is a known regression, all SYSFS components must be removed
>> before pci_remove code runs otherwise you get either warnings for
>> single file removals or
>> OOPSEs for sysfs gorup removals like here. Please try to move
>> amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of
>> amdgpu_ras_pre_fini (which happens before pci remove)
>>
>>
>
> I fixed it in the newer patch, please see it below.
> I first plugout the device, then kill the rocm user process. Then it
> has other OOPSES related to ttm_bo_cleanup_refs.
>
> [ +0.000006] BUG: kernel NULL pointer dereference, address:
> 0000000000000010
> [ +0.000349] #PF: supervisor read access in kernel mode
> [ +0.000340] #PF: error_code(0x0000) - not-present page
> [ +0.000341] PGD 0 P4D 0
> [ +0.000336] Oops: 0000 [#1] SMP NOPTI
> [ +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G W E
> 5.13.0-kfd #1
> [ +0.000367] Hardware name: INGRASYS TURING /MB , BIOS
> K71FQ28A 10/05/2021
> [ +0.000376] Workqueue: events delayed_fput
> [ +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
> [ +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b
> 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00
> 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
> [ +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
> [ +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX:
> 00000000002a000c
> [ +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI:
> 0000000000000000
> [ +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09:
> ffffffffc03c3e00
> [ +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff8b1a823b5ec8
> [ +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15:
> ffff8b1a823b5ec8
> [ +0.000544] FS: 0000000000000000(0000) GS:ffff8b187f440000(0000)
> knlGS:0000000000000000
> [ +0.000559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4:
> 0000000000770ee0
> [ +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ +0.000575] PKRU: 55555554
> [ +0.000568] Call Trace:
> [ +0.000567] ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
> [ +0.000588] ttm_bo_delayed_delete+0x147/0x250 [ttm]
> [ +0.000589] ttm_device_fini+0xad/0x1b0 [ttm]
> [ +0.000590] amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
> [ +0.000730] gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
> [ +0.000753] amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
> [ +0.000734] amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> [ +0.000737] drm_dev_release+0x20/0x40 [drm]
> [ +0.000626] drm_release+0xa8/0xf0 [drm]
> [ +0.000625] __fput+0xa5/0x250
> [ +0.000606] delayed_fput+0x1f/0x30
> [ +0.000607] process_one_work+0x26e/0x580
> [ +0.000608] ? process_one_work+0x580/0x580
> [ +0.000616] worker_thread+0x4d/0x3d0
> [ +0.000614] ? process_one_work+0x580/0x580
> [ +0.000617] kthread+0x117/0x150
> [ +0.000615] ? kthread_park+0x90/0x90
> [ +0.000621] ret_from_fork+0x1f/0x30
> [ +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE
> nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal
> cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper
> drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect
> syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea
> drm drm_panel_orientation_quirks [last unloaded: amdgpu]
> [ +0.002840] CR2: 0000000000000010
> [ +0.000755] ---[ end trace 9737737402551e39 ]--
This looks like another regression - try seeing where is the NULL
reference and then we can see how to avoid this.
>
>>>
>>> 3. echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the
>>> sysfs is broken.
>>>
>>> Based on 1 & 2, it seems that 1 won’t let the amdgpu exit
>>> gracefully, because 2 will do some cleanup maybe should have
>>> happened before 1.
>>>>
>>>> or you kill after plug back does it makes a difference).
>>>>
>>> *Scenario 2: Kill after plug back*
>>>
>>> If I perform rescan before kill, then the driver seemed probed fine.
>>> But kill will have the same issue which messed up the sysfs the same
>>> way as in Scenario 2.
>>>
>>>
>>> *Final Comments:*
>>>
>>> 0. cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>> would make the repletion of amdgpu_vm_bo_update failure go away, but
>>> it does not solve the issues in those scenarios.
>>
>>
>> Still - it's better to do it this way even for those failures to go awaya
>>
>>
> Cancel_delayed_work is insufficient, you will need to make sure the
> work won’t be processed after plugout. Please see my patch
Saw, see my comment.
>>
>>>
>>> 1. For planned hotplug, this patch should work as long as you follow
>>> some protocol, i.e. kill before plugout. Is this patch an acceptable
>>> one since it provides some added feature than before?
>>
>>
>> Let's try to fix more as I advised above.
>>
>>
>>>
>>> 2. For unplanned hotplug when there is rocm app running, the patch
>>> that kill all processes and wait for 5 sec would work consistently.
>>> But it seems that it is an unacceptable solution for official
>>> release. I can hold it for our own internal usage. It seems that
>>> kill after removal would cause problems, and I don’t know if there
>>> is a quick fix by me because of my limited understanding of the
>>> amdgpu driver. Maybe AMD could have a quick fix; Or it is really a
>>> difficult one. This feature may or may not be a blocking issue in
>>> our GPU disaggregation research down the way. Please let us know for
>>> either cases, and we would like to learn and help as much as we could!
>>
>>
>> I am currently not sure why it helps. I will need to setup my own
>> ROCm setup and retest hot plug to check this in more depth but
>> currently i have higher priorities. Please try to confirm ASIC reset
>> always takes place on plug back
>> and fix the sysfs OOPs as I advised above to clear up at least some
>> of the issues. Also please describe to me exactly what you steps to
>> reproduce this scenario so later I might be able to do it myself.
>>
>>
> I can still try to help to fix the bug in my spare time. My setup is
> as follows
>
> 1. I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
> 2. I used the
> https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Ftree%2Froc-5.0.x&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbfb611b3f5984b73ad3708da213e781e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637858849473779872%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fJZBoJHu%2FwD8Ge1whM1VUERBNlCAyrnUGU78RLDp5yg%3D&reserved=0> as
> the starting point, and apply Mukul’s patch and my patch.
> 3. Then I run a tensorflow benchmark from a docker.
> * docker run -it --device=/dev/kfd --device=/dev/dri --group-add
> video rocm/tensorflow:rocm4.5.2-tf1.15-dev
> * And run the following benchmark in the docker: python
> benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
> --num_gpus=4 --batch_size=32 --model=resnet50
> --variable_update=parameter_server
> o Might to need to adjust num_gpus parameter based on your setup
> 4. Remove a GPU at random time.
> 5. Do whatever is needed to before plugback and reverify the
> benchmark can still run.
>
>> Also, we have hotplug test suite in libdrm (graphic stack), so maybe
>> u can install libdrm and run that test suite to see if it exposes
>> more issues.
>>
> OK I could try it some time.
>
> The following is the new diff.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 182b7eae598a..48c3cd4054de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
> * ASICs macro.
> */
> #define amdgpu_asic_set_vga_state(adev, state)
> (adev)->asic_funcs->set_vga_state((adev), (state))
> -#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
> +#define amdgpu_asic_reset(adev) ({int r; pr_info("performing
> amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
> #define amdgpu_asic_reset_method(adev)
> (adev)->asic_funcs->reset_method((adev))
> #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
> #define amdgpu_asic_set_uvd_clocks(adev, v, d)
> (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 27c74fcec455..842abd7150a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -134,6 +134,7 @@ struct amdkfd_process_info {
> /* MMU-notifier related fields */
> atomic_t evicted_bos;
> +atomic_t invalid;
> struct delayed_work restore_userptr_work;
> struct pid *pid;
> };
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 99d2b15bcbf3..2a588eb9f456 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm,
> void **process_info,
> info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
> atomic_set(&info->evicted_bos, 0);
> +atomic_set(&info->invalid, 0);
> INIT_DELAYED_WORK(&info->restore_userptr_work,
> amdgpu_amdkfd_restore_userptr_worker);
> @@ -2693,6 +2694,9 @@ static void
> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
> struct mm_struct *mm;
> int evicted_bos;
> +if (atomic_read(&process_info->invalid))
> +return;
> +
Probably better to again use drm_dev_enter/exit guard pair instead of
this flag.
> evicted_bos = atomic_read(&process_info->evicted_bos);
> if (!evicted_bos)
> return;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ec38517ab33f..e7d85d8d282d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1054,6 +1054,7 @@ void
> amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
> */
> void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
> {
> +pr_debug("%s called\n",__func__);
> pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
> }
> @@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct
> amdgpu_device *adev)
> */
> int amdgpu_device_pci_reset(struct amdgpu_device *adev)
> {
> +pr_debug("%s called\n",__func__);
> return pci_reset_function(adev->pdev);
> }
> @@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head
> *device_list_handle,
> bool need_full_reset, skip_hw_reset, vram_lost = false;
> int r = 0;
> +pr_debug("%s called\n",__func__);
> +
> /* Try reset handler method first */
> tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
> reset_list);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 49bdf9ff7350..b469acb65c1e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device
> *adev,
> if (!ras_block || !ih_info)
> return;
> -amdgpu_ras_sysfs_remove(adev, ras_block);
> if (ih_info->cb)
> amdgpu_ras_interrupt_remove_handler(adev, ih_info);
> }
> @@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
> int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
> {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +struct ras_manager *obj, *tmp;
> if (!adev->ras_enabled || !con)
> return 0;
> @@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
> /* Need disable ras on all IPs here before ip [hw/sw]fini */
> amdgpu_ras_disable_all_features(adev, 0);
> amdgpu_ras_recovery_fini(adev);
> +
> +/* remove sysfs before pci_remove to avoid OOPSES from
> sysfs_remove_groups */
> +list_for_each_entry_safe(obj, tmp, &con->head, node) {
> +amdgpu_ras_sysfs_remove(adev, &obj->head);
> +put_obj(obj);
> +}
> +
> return 0;
> }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 4e7d9cb09a69..0fa806a78e39 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -693,16 +693,35 @@ bool kfd_is_locked(void)
> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
> {
> +struct kfd_process *p;
> +struct amdkfd_process_info *p_info;
> +unsigned int temp;
> +
> if (!kfd->init_complete)
> return;
> /* for runtime suspend, skip locking kfd */
> -if (!run_pm) {
> +if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
> /* For first KFD device suspend all the KFD processes */
> if (atomic_inc_return(&kfd_locked) == 1)
> kfd_suspend_all_processes(force);
> }
> +if (drm_dev_is_unplugged(kfd->ddev)){
> +int idx = srcu_read_lock(&kfd_processes_srcu);
> +pr_debug("cancel restore_userptr_wor\n");
> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
> +p_info = p->kgd_process_info;
> +pr_debug("cancel processes, pid = %d for gpu_id = %d",
> pid_nr(p_info->pid), kfd->id);
> +cancel_delayed_work_sync(&p_info->restore_userptr_work);
> +/* block all future restore_userptr_work */
> +atomic_inc(&p_info->invalid);
Same as i mentioned above with drm.dev_eneter/exit
Andrey
> +}
> +}
> +srcu_read_unlock(&kfd_processes_srcu, idx);
> +}
> +
> kfd->dqm->ops.stop(kfd->dqm);
> kfd_iommu_suspend(kfd);
> }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index 600ba2a728ea..7e3d1848eccc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct
> kfd_topology_device *dev)
> #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
> if (dev->kobj_perf) {
> list_for_each_entry(perf, &dev->perf_props, list) {
> +sysfs_remove_group(dev->kobj_perf, perf->attr_group);
> kfree(perf->attr_group);
> perf->attr_group = NULL;
> }
> kobject_del(dev->kobj_perf);
> -kobject_put(dev->kobj_perf);
> +/* kobject_put(dev->kobj_perf); */
> dev->kobj_perf = NULL;
> }
> #endif
>
> Thank you so much! Looking forward to your comments!
>
> Regards,
> Shuotao
>>
>> Andrey
>>
>>
>>>
>>> Thank you so much!
>>>
>>> Best regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 8fa9b86ac9d2..c0b27f722281 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct
>>>>> amdgpu_device *adev,
>>>>> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>> }
>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
>>>>> +{
>>>>> +if (adev->kfd.dev)
>>>>> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
>>>>> +}
>>>>> +
>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>> {
>>>>> if (adev->kfd.dev)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index 27c74fcec455..f4e485d60442 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>>>>> int amdgpu_amdkfd_init(void);
>>>>> void amdgpu_amdkfd_fini(void);
>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm,
>>>>> bool sync);
>>>>> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>> const struct kgd2kfd_shared_resources *gpu_resources);
>>>>> void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct
>>>>> kfd_dev *kfd, bool run_pm, bool force)
>>>>> {
>>>>> }
>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
>>>>> +}
>>>>> +
>>>>> static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>> {
>>>>> return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 3d5fc0751829..af6fe5080cfa 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>>> {
>>>>> struct drm_device *dev = pci_get_drvdata(pdev);
>>>>> +/* kill all kfd processes before drm_dev_unplug */
>>>>> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
>>>>> +
>>>>> #ifdef HAVE_DRM_DEV_UNPLUG
>>>>> drm_dev_unplug(dev);
>>>>> #else
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 5504a18b5a45..480c23bef5e2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
>>>>> return (atomic_read(&kfd_locked) > 0);
>>>>> }
>>>>> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
>>>>> +{
>>>>> +kfd_kill_all_user_processes();
>>>>> +}
>>>>> +
>>>>> +
>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>>>>> {
>>>>> if (!kfd->init_complete)
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> index 55c9e1922714..a35a2cb5bb9f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device
>>>>> *kfd_process_device_from_gpuidx(
>>>>> void kfd_unref_process(struct kfd_process *p);
>>>>> int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>> int kfd_process_restore_queues(struct kfd_process *p);
>>>>> +void kfd_kill_all_user_processes(void);
>>>>> void kfd_suspend_all_processes(bool force);
>>>>> /*
>>>>> * kfd_resume_all_processes:
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 6cdc855abb6d..17e769e6951d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -46,6 +46,9 @@ struct mm_struct;
>>>>> #include "kfd_trace.h"
>>>>> #include "kfd_debug.h"
>>>>> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
>>>>> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
>>>>> +
>>>>> /*
>>>>> * List of struct kfd_process (field kfd_process).
>>>>> * Unique/indexed by mm_struct*
>>>>> @@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct
>>>>> task_struct *thread)
>>>>> struct kfd_process *process;
>>>>> int ret;
>>>>> +if ( atomic_read(&kfd_process_locked) > 0 )
>>>>> +return ERR_PTR(-EINVAL);
>>>>> +
>>>>> if (!(thread->mm && mmget_not_zero(thread->mm)))
>>>>> return ERR_PTR(-EINVAL);
>>>>> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct
>>>>> work_struct *work)
>>>>> put_task_struct(p->lead_thread);
>>>>> kfree(p);
>>>>> +
>>>>> +if ( atomic_read(&kfd_process_locked) > 0 ){
>>>>> +atomic_dec(&kfd_inflight_kills);
>>>>> +}
>>>>> }
>>>>> static void kfd_process_ref_release(struct kref *ref)
>>>>> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct
>>>>> work_struct *work)
>>>>> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>>>>> }
>>>>> +void kfd_kill_all_user_processes(void)
>>>>> +{
>>>>> +struct kfd_process *p;
>>>>> +/* struct amdkfd_process_info *p_info; */
>>>>> +unsigned int temp;
>>>>> +int idx;
>>>>> +atomic_inc(&kfd_process_locked);
>>>>> +
>>>>> +idx = srcu_read_lock(&kfd_processes_srcu);
>>>>> +pr_info("Killing all processes\n");
>>>>> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>> +dev_warn(kfd_device,
>>>>> +"Sending SIGBUS to process %d (pasid 0x%x)",
>>>>> +p->lead_thread->pid, p->pasid);
>>>>> +send_sig(SIGBUS, p->lead_thread, 0);
>>>>> +atomic_inc(&kfd_inflight_kills);
>>>>> +}
>>>>> +srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>> +
>>>>> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
>>>>> +dev_warn(kfd_device,
>>>>> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
>>>>> +msleep(10);
>>>>> +}
>>>>> +
>>>>> +atomic_dec(&kfd_process_locked);
>>>>> +pr_info("all processes has been fully released");
>>>>> +}
>>>>> +
>>>>> void kfd_suspend_all_processes(bool force)
>>>>> {
>>>>> struct kfd_process *p;
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>> + }
>>>>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> int kfd_resume_all_processes(bool sync)
>>>>>>> {
>>>>>>> struct kfd_process *p;
>>>>>>>
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Really appreciate your help!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Shuotao
>>>>>>>>>
>>>>>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is
>>>>>>>>>>> hotplugged
>>>>>>>>>>> out.
>>>>>>>>>>>
>>>>>>>>>>> 3. New kfd node_id is not properly assigned after a new
>>>>>>>>>>> device is
>>>>>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>>>>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>>>>>>>> stack.
>>>>>>>>>>>
>>>>>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes
>>>>>>>>>>> issues 2&3.
>>>>>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd;
>>>>>>>>>>> kernel
>>>>>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct
>>>>>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>>>>>> return r;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>>>>>> {
>>>>>>>>>>> int r = 0;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool
>>>>>>>>>>> run_pm);
>>>>>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool
>>>>>>>>>>> run_pm);
>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>>>>>> const void *ih_ring_entry);
>>>>>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev
>>>>>>>>>>> *kfd);
>>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
>>>>>>>>>>> *ih_ring_entry);
>>>>>>>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct
>>>>>>>>>>> kfd_dev *kfd, bool run_pm)
>>>>>>>>>>> return 0;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>>>>>> {
>>>>>>>>>>> return 0;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct
>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>>>>>
>>>>>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd,
>>>>>>>>>>> bool run_pm)
>>>>>>>>>>> return ret;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +/* for non-runtime resume only */
>>>>>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + int count;
>>>>>>>>>>> +
>>>>>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>>>>>> + if (count == 0)
>>>>>>>>>>> + return kfd_resume_all_processes();
>>>>>>>>>>> +
>>>>>>>>>>> + return 0;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>>>>>> kgd2kfd_suspend to only decrement it again a few functions
>>>>>>>>>> down the
>>>>>>>>>> road.
>>>>>>>>>>
>>>>>>>>>> I suggest this instead - you only incrmemnt if not during PCI
>>>>>>>>>> remove
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>>>>>
>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>>>>>> {
>>>>>>>>>> +
>>>>>>>>>> if (!kfd->init_complete)
>>>>>>>>>> return;
>>>>>>>>>>
>>>>>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>>>>>> - if (!run_pm) {
>>>>>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>>>>> kfd_suspend_all_processes();
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>>>>>> {
>>>>>>>>>>> int err = 0;
>>>>>>>
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220418/c777ae09/attachment-0001.htm>
More information about the amd-gfx
mailing list