[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Apr 27 16:04:03 UTC 2022
On 2022-04-27 05:20, Shuotao Xu wrote:
> Hi Andrey,
>
> Sorry that I did not have time to work on this for a few days.
>
> I just tried the sysfs crash fix on Radeon VII and it seems that it
> worked. It did not pass last the hotplug test, but my version has 4
> tests instead of 3 in your case.
That because the 4th one is only enabled when here are 2 cards in the
system - to test DRI_PRIME export. I tested this time with only one card.
>
>
> Suite: Hotunplug Tests
> Test: Unplug card and rescan the bus to plug it back
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
> Test: Same as first test but with command submission
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
> Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids:
> No such file or directory
> passed
> Test: Unplug with exported fence
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
on the kernel side - the IOCTlL returning this is drm_getclient - maybe
take a look while it can't find client it ? I didn't have such issue as
far as I remember when testing.
> FAILED
> 1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
> 2. ../tests/amdgpu/hotunplug_tests.c:411 -
> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
> &sync_obj_handle2),0)
> 3. ../tests/amdgpu/hotunplug_tests.c:423 -
> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1,
> 100000000, 0, NULL),0)
> 4. ../tests/amdgpu/hotunplug_tests.c:425 -
> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>
> Run Summary: Type Total Ran Passed Failed Inactive
> suites 14 1 n/a 0 0
> tests 71 4 3 1 0
> asserts 39 39 35 4 n/a
>
> Elapsed time = 17.321 seconds
>
> For kfd compute, there is some problem which I did not see in MI100
> after I killed the hung application after hot plugout. I was using
> rocm5.0.2 driver for MI100 card, and not sure if it is a regression
> from the newer driver.
> After pkill, one of child of user process would be stuck in Zombie
> mode (Z) understandably because of the bug, and future rocm
> application after plug-back would in uninterrupted sleep mode (D)
> because it would not return from syscall to kfd.
>
> Although drm test for amdgpu would run just fine without issues after
> plug-back with dangling kfd state.
I am not clear when the crash bellow happens ? Is it related to what you
describe above ?
>
> I don’t know if there is a quick fix to it. I was thinking add
> drm_enter/drm_exit to amdgpu_device_rreg.
Try adding drm_dev_enter/exit pair at the highest level of attmetong to
access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always
try to avoid accessing any HW functions after backing device is gone.
> Also this has been a long time in my attempt to fix hotplug issue for
> kfd application.
> I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII
> would mean something but MI100 is more important for us); 2) what the
> direct of the patch to this issue will move forward.
I will go to office tomorrow to pick up MI-100, With time and priorities
permitting I will then then try to test it and fix any bugs such that it
will be passing all hot plug libdrm tests at the tip of public
amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after
that you can try to continue working with ROCm enabling on top of that.
For now i suggest you move on with Radeon 7 which as your development
ASIC and use the fix i mentioned above.
Andrey
>
> Regards,
> Shuotao
>
> [ +0.001645] BUG: unable to handle page fault for address:
> 0000000000058a68
> [ +0.001298] #PF: supervisor read access in kernel mode
> [ +0.001252] #PF: error_code(0x0000) - not-present page
> [ +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067
> PMD 0
> [ +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
> [ +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
> W E 5.16.0+ #3
> [ +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
> 1.5.4 [FPGA Test BIOS] 10/002/2015
> [ +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [ +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [ +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [ +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
> 00000000ffffffff
> [ +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
> ffff8b0c9c840000
> [ +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
> 0000000000000001
> [ +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
> 0000000000058a68
> [ +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
> 000000000001629a
> [ +0.001397] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
> knlGS:0000000000000000
> [ +0.001411] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
> 00000000001706e0
> [ +0.001422] Call Trace:
> [ +0.001407] <TASK>
> [ +0.001391] amdgpu_device_rreg+0x17/0x20 [amdgpu]
> [ +0.001614] amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
> [ +0.001735] phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
> [ +0.001790] phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
> [ +0.001800] vega20_wait_for_response+0x28/0x80 [amdgpu]
> [ +0.001757] vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
> [ +0.001838] smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
> [ +0.001829] ? kvfree+0x1e/0x30
> [ +0.001462] vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
> [ +0.001868] ? kvfree+0x1e/0x30
> [ +0.001462] ? ttm_bo_release+0x261/0x370 [ttm]
> [ +0.001467] pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
> [ +0.001863] amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
> [ +0.001866] amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
> [ +0.001784] kfd_dec_compute_active+0x2c/0x50 [amdgpu]
> [ +0.001744] process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
> [ +0.001728] kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
> [ +0.001730] kfd_process_notifier_release+0x91/0xe0 [amdgpu]
> [ +0.001718] __mmu_notifier_release+0x77/0x1f0
> [ +0.001411] exit_mmap+0x1b5/0x200
> [ +0.001396] ? __switch_to+0x12d/0x3e0
> [ +0.001388] ? __switch_to_asm+0x36/0x70
> [ +0.001372] ? preempt_count_add+0x74/0xc0
> [ +0.001364] mmput+0x57/0x110
> [ +0.001349] do_exit+0x33d/0xc20
> [ +0.001337] ? _raw_spin_unlock+0x1a/0x30
> [ +0.001346] do_group_exit+0x43/0xa0
> [ +0.001341] get_signal+0x131/0x920
> [ +0.001295] arch_do_signal_or_restart+0xb1/0x870
> [ +0.001303] ? do_futex+0x125/0x190
> [ +0.001285] exit_to_user_mode_prepare+0xb1/0x1c0
> [ +0.001282] syscall_exit_to_user_mode+0x2a/0x40
> [ +0.001264] do_syscall_64+0x46/0xb0
> [ +0.001236] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ +0.001219] RIP: 0033:0x7f6aff1d2ad3
> [ +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
> [ +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000ca
> [ +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
> 00007f6aff1d2ad3
> [ +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
> 0000000004f542d8
> [ +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
> 0000000000000000
> [ +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000000004f542d8
> [ +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
> 0000000000000000
> [ +0.001152] </TASK>
> [ +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4
> xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel
> ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel
> snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass
> ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev
> mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic
> zstd_compress raid10 raid456
> [ +0.000102] async_raid6_recov async_memcpy async_pq async_xor
> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2
> gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul
> hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel
> crypto_simd igb ahci hid drm usb_storage cryptd libahci dca
> megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
> [ +0.016626] CR2: 0000000000058a68
> [ +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
> [ +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [ +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [ +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [ +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
> 00000000ffffffff
> [ +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
> ffff8b0c9c840000
> [ +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
> 0000000000000001
> [ +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
> 0000000000058a68
> [ +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
> 000000000001629a
> [ +0.001648] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
> knlGS:0000000000000000
> [ +0.001668] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
> 00000000001706e0
> [ +0.001740] Fixing recursive fault but reboot is needed!
>
>
>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>> <andrey.grodzovsky at amd.com> wrote:
>>
>> I retested hot plug tests at the commit I mentioned bellow - looks
>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris
>> ASICs (whatever i had at home at the time). It's possible there are
>> extra issues in ASICs like ur which I didn't cover during tests.
>>
>> andrey at andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support VCE, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD ENC, suite disabled.
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> Don't support TMZ (trust memory zone), security suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> Peer device is not opened or has ASIC not supported by the suite,
>> skip all Peer to Peer tests.
>>
>>
>> CUnit - A unit testing framework for C - Version 2.1-3
>> http://cunit.sourceforge.net/
>>
>>
>> *Suite: Hotunplug Tests**
>> ** Test: Unplug card and rescan the bus to plug it back
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> ** Test: Same as first test but with command submission
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> ** Test: Unplug with exported bo
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed*
>>
>> Run Summary: Type Total Ran Passed Failed Inactive
>> suites 14 1 n/a 0 0
>> tests 71 3 3 0 1
>> asserts 21 21 21 0 n/a
>>
>> Elapsed time = 9.195 seconds
>>
>>
>> Andrey
>>
>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>
>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>> fixed so you can use the same fix. The MI 200 issue i haven't seen
>>> yet but I also haven't tested MI200 so never saw it before. Need to
>>> test when i get the time.
>>>
>>> So try that fix with Radeon 7 again to see if you pass the tests
>>> (the warnings should all be minor issues).
>>>
>>> Andrey
>>>
>>>
>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>
>>>>> That a problem, latest working baseline I tested and confirmed
>>>>> passing hotplug tests is this branch and commit
>>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6
>>>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed
>>>>> the hotplug code but it had a lot of regressions over time due to
>>>>> new changes (that why I added the hotplug test to try and catch
>>>>> them early). It would be best to run this branch on mi-100 so we
>>>>> have a clean baseline and only after confirming this particular
>>>>> branch from this commits passes libdrm tests only then start
>>>>> adding the KFD specific addons. Another option if you can't work
>>>>> with MI-100 and this branch is to try a different ASIC that does
>>>>> work with this branch (if possible).
>>>>>
>>>>> Andrey
>>>>>
>>>> OK I tried both this commit and the HEAD of and-staging-drm-next on
>>>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm
>>>> test. I might be able to gain access to MI200, but I suspect it
>>>> would work.
>>>>
>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES
>>>> for you.
>>>>
>>>> Radeon VII:
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220427/7be75017/attachment-0001.htm>
More information about the amd-gfx
mailing list