[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Apr 6 15:39:30 UTC 2022
+ Felix
On 2022-04-06 11:11, Shuotao Xu wrote:
> Hi Andrey,
>
> Thanks for your kind comment on linux patch submission protocol, please
> let me know if there is anyway to rectify it.
Just resend your patch to amd-gfx mailoing list using
git-send (see here how to use it -
https://burzalodowa.wordpress.com/2013/10/05/how-to-send-patches-with-git-send-email/)
I suggest adding --cover-letter so you will be able to explain the
story behind the patch.
>
> dmesg is fine except with some warning during pci rescan after pci
> removal of an AMD MI100.
>
> The issue is that after this rocm application will return segfault with
> the amdgpu driver unless the entire amdgpu kernel module is unloaded and
> loaded, which we did not meet our hotplug requirement. The issues upon
> investigation are
>
> 1) kfd_lock is locked after hotplug, and kfd_open will return fault
> right away to libhsakmt .
I see now, kfd_lock is static and so single isntance across all devices
and so not going away after device removal but only after driver unload.
In this case I am not sure it's the best idea to just decrement kfd_lock
on device inint since in multi GPU system this might be locked on
purpuse because another device is going through reset for example right
at this moment.
Felix, kgd2kfd_suspend is called also during device pci remove meaning
unblalanced decrement of the lock. Maybe we should not decremnt I
adding drm_dev_enter guard in kgd2kfd_suspend to avoid decrment of
kfd_locked if we are during PCI remove.
>
> 2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found
> such anomalies and return error.
Can you point to what abnormalities ? Part of PCI hot unplug we clean
all sysfs files and this looks like part of it, do you see sysfs file
already exsist error on next pci_rescan ?
>
> Our patch has been tested with a single-instance AMD MI100 GPU and
> showed it worked.
Exactly, for multi GPU system arbitrary decrementing kfd_lock on device load
can be problematic.
Andrey
>
> I am attaching the dmesg after rescan anyway, which will show the
> warning and the segfault.
>
> [ 132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
>
> [ 132.054856] pci 0000:43:00.0: reg 0x10: [mem
> 0x38b000000000-0x38b7ffffffff 64bit pref]
>
> [ 132.054877] pci 0000:43:00.0: reg 0x18: [mem
> 0x38b800000000-0x38b8001fffff 64bit pref]
>
> [ 132.054890] pci 0000:43:00.0: reg 0x20: [io 0xa000-0xa0ff]
>
> [ 132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
>
> [ 132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
>
> [ 132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
>
> [ 132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> [ 132.056001] pci 0000:43:00.0: Adding to iommu group 73
>
> [ 132.057943] pci 0000:43:00.0: BAR 0: assigned [mem
> 0x38b000000000-0x38b7ffffffff 64bit pref]
>
> [ 132.057960] pci 0000:43:00.0: BAR 2: assigned [mem
> 0x38b800000000-0x38b8001fffff 64bit pref]
>
> [ 132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
>
> [ 132.057981] pci 0000:43:00.0: BAR 6: assigned [mem
> 0xb8480000-0xb849ffff pref]
>
> [ 132.057984] pci 0000:43:00.0: BAR 4: assigned [io 0xa000-0xa0ff]
>
> [ 132.058429] ======================================================
>
> [ 132.058453] WARNING: possible circular locking dependency detected
>
> [ 132.058477] 5.16.0-kfd+ #1 Not tainted
>
> [ 132.058492] ------------------------------------------------------
>
> [ 132.058515] bash/3632 is trying to acquire lock:
>
> [ 132.058534] ffffadee20adfb50
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
>
> [ 132.058554] [drm] initializing kernel modesetting (ARCTURUS
> 0x1002:0x738C 0x1002:0x0C34 0x01).
>
> [ 132.058577]
>
> but task is already holding lock:
>
> [ 132.058580] ffffffffa3c62308
>
> [ 132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ)
> feature not supported
>
> [ 132.058638] (
>
> [ 132.058678] [drm] register mmio base: 0xB8400000
>
> [ 132.058683] pci_rescan_remove_lock
>
> [ 132.058694] [drm] register mmio size: 524288
>
> [ 132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
>
> [ 132.058773]
>
> which lock already depends on the new lock.
>
> [ 132.058804]
>
> the existing dependency chain (in reverse order) is:
>
> [ 132.058819] [drm] add ip block number 0 <soc15_common>
>
> [ 132.058831]
>
> -> #1 (
>
> [ 132.058854] [drm] add ip block number 1 <gmc_v9_0>
>
> [ 132.058858] [drm] add ip block number 2 <vega20_ih>
>
> [ 132.058874] pci_rescan_remove_lock
>
> [ 132.058894] [drm] add ip block number 3 <psp>
>
> [ 132.058915] ){+.+.}-{3:3}
>
> [ 132.058931] [drm] add ip block number 4 <smu>
>
> [ 132.058951] :
>
> [ 132.058965] [drm] add ip block number 5 <gfx_v9_0>
>
> [ 132.058986] __mutex_lock+0xa4/0x990
>
> [ 132.058996] [drm] add ip block number 6 <sdma_v4_0>
>
> [ 132.059016] i801_add_tco_spt.isra.20+0x2a/0x1a0
>
> [ 132.059033] [drm] add ip block number 7 <vcn_v2_5>
>
> [ 132.059054] i801_add_tco+0xf6/0x110
>
> [ 132.059075] [drm] add ip block number 8 <jpeg_v2_5>
>
> [ 132.059096] i801_probe+0x402/0x860
>
> [ 132.059151] local_pci_probe+0x40/0x90
>
> [ 132.059170] work_for_cpu_fn+0x10/0x20
>
> [ 132.059189] process_one_work+0x2a4/0x640
>
> [ 132.059208] worker_thread+0x228/0x3f0
>
> [ 132.059227] kthread+0x16d/0x1a0
>
> [ 132.059795] ret_from_fork+0x1f/0x30
>
> [ 132.060337]
>
> -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
>
> [ 132.061405] __lock_acquire+0x1552/0x1ac0
>
> [ 132.061950] lock_acquire+0x26c/0x300
>
> [ 132.062484] __flush_work+0x315/0x470
>
> [ 132.063009] work_on_cpu+0x98/0xc0
>
> [ 132.063526] pci_device_probe+0x1bc/0x1d0
>
> [ 132.064036] really_probe+0x102/0x450
>
> [ 132.064532] __driver_probe_device+0x100/0x170
>
> [ 132.065020] driver_probe_device+0x1f/0xa0
>
> [ 132.065497] __device_attach_driver+0x6b/0xe0
>
> [ 132.065975] bus_for_each_drv+0x6a/0xb0
>
> [ 132.066449] __device_attach+0xe2/0x160
>
> [ 132.066912] pci_bus_add_device+0x4a/0x80
>
> [ 132.067365] pci_bus_add_devices+0x2c/0x70
>
> [ 132.067812] pci_bus_add_devices+0x65/0x70
>
> [ 132.068253] pci_bus_add_devices+0x65/0x70
>
> [ 132.068688] pci_bus_add_devices+0x65/0x70
>
> [ 132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
>
> [ 132.069109] pci_bus_add_devices+0x65/0x70
>
> [ 132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
>
> [ 132.070058] pci_bus_add_devices+0x65/0x70
>
> [ 132.070572] [drm] VCN(0) decode is enabled in VM mode
>
> [ 132.070997] pci_rescan_bus+0x23/0x30
>
> [ 132.071000] rescan_store+0x61/0x90
>
> [ 132.071003] kernfs_fop_write_iter+0x132/0x1b0
>
> [ 132.071501] [drm] VCN(1) decode is enabled in VM mode
>
> [ 132.071964] new_sync_write+0x11f/0x1b0
>
> [ 132.072432] [drm] VCN(0) encode is enabled in VM mode
>
> [ 132.072900] vfs_write+0x35b/0x3b0
>
> [ 132.073376] [drm] VCN(1) encode is enabled in VM mode
>
> [ 132.073847] ksys_write+0xa7/0xe0
>
> [ 132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
>
> [ 132.074803] do_syscall_64+0x34/0x80
>
> [ 132.074808] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> [ 132.074811]
>
> other info that might help us debug this:
>
> [ 132.074813] Possible unsafe locking scenario:
>
> [ 132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
>
> [ 132.075779] CPU0 CPU1
>
> [ 132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
>
> [ 132.076765] ---- ----
>
> [ 132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
>
> [ 132.078649] lock(pci_rescan_remove_lock);
>
> [ 132.078652]
> lock((work_completion)(&wfc.work));
>
> [ 132.078653] lock(pci_rescan_remove_lock);
>
> [ 132.078655] lock((work_completion)(&wfc.work));
>
> [ 132.078656]
>
> *** DEADLOCK ***
>
> [ 132.078656] 5 locks held by bash/3632:
>
> [ 132.078658] #0: ffff9c39c7b89438
>
> [ 132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized
> successfully, hardware ability[7fff] ras_mask[7fff]
>
> [ 132.080089] (
>
> [ 132.080602] [drm] vm size is 262144 GB, 4 levels, block size is
> 9-bit, fragment size is 9-bit
>
> [ 132.081082] sb_writers
>
> [ 132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M
> 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
>
> [ 132.082102] #6
>
> [ 132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M
> 0x0000000000000000 - 0x000000001FFFFFFF
>
> [ 132.083152] ){.+.+}-{0:0}
>
> [ 132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M
> 0x0000008800000000 - 0x0000FFFFFFFFFFFF
>
> [ 132.084210] , at: ksys_write+0xa7/0xe0
>
> [ 132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
>
> [ 132.086205] #1:
>
> [ 132.086733] [drm] RAM width 4096bits HBM
>
> [ 132.087269] ffff9c5959011088
>
> [ 132.087890] [drm] amdgpu: 32752M of VRAM memory ready
>
> [ 132.088389] (
>
> [ 132.088972] [drm] amdgpu: 32752M of GTT memory ready.
>
> [ 132.089572] &of->mutex
>
> [ 132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
>
> [ 132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
>
> [ 132.090808] #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at:
> kernfs_fop_write_iter+0x10c/0x1b0
>
> [ 132.091639] [drm] PCIE GART of 512M enabled.
>
> [ 132.092117] #3:
>
> [ 132.092801] [drm] PTB located at 0x0000008000000000
>
> [ 132.093480] ffffffffa3c62308
>
> [ 132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't
> exist
>
> [ 132.094822] (pci_rescan_remove_lock){+.+.}-{3:3}, at:
> rescan_store+0x55/0x90
>
> [ 132.094827] #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at:
> __device_attach+0x39/0x160
>
> [ 132.094835]
>
> stack backtrace:
>
> [ 132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0
> Revision: 21
>
> [ 132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
>
> [ 132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN
> firmware
>
> [ 132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU,
> BIOS 2.1 08/14/2018
>
> [ 132.098841] Call Trace:
>
> [ 132.098842] <TASK>
>
> [ 132.098843] dump_stack_lvl+0x44/0x57
>
> [ 132.098848] check_noncircular+0x105/0x120
>
> [ 132.098853] ? unwind_get_return_address+0x1b/0x30
>
> [ 132.112924] ? register_lock_class+0x46/0x780
>
> [ 132.113630] ? __lock_acquire+0x1552/0x1ac0
>
> [ 132.114342] __lock_acquire+0x1552/0x1ac0
>
> [ 132.115050] lock_acquire+0x26c/0x300
>
> [ 132.115755] ? __flush_work+0x2f5/0x470
>
> [ 132.116460] ? lock_is_held_type+0xdf/0x130
>
> [ 132.117177] __flush_work+0x315/0x470
>
> [ 132.117890] ? __flush_work+0x2f5/0x470
>
> [ 132.118604] ? lock_is_held_type+0xdf/0x130
>
> [ 132.119305] ? mark_held_locks+0x49/0x70
>
> [ 132.119981] ? queue_work_on+0x2f/0x70
>
> [ 132.120645] ? lockdep_hardirqs_on+0x79/0x100
>
> [ 132.121300] work_on_cpu+0x98/0xc0
>
> [ 132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
>
> [ 132.121947] ? __traceiter_workqueue_execute_end+0x40/0x40
>
> [ 132.123270] ? pci_device_shutdown+0x60/0x60
>
> [ 132.123880] pci_device_probe+0x1bc/0x1d0
>
> [ 132.124475] really_probe+0x102/0x450
>
> [ 132.125060] __driver_probe_device+0x100/0x170
>
> [ 132.125641] driver_probe_device+0x1f/0xa0
>
> [ 132.126215] __device_attach_driver+0x6b/0xe0
>
> [ 132.126797] ? driver_allows_async_probing+0x50/0x50
>
> [ 132.127383] ? driver_allows_async_probing+0x50/0x50
>
> [ 132.127960] bus_for_each_drv+0x6a/0xb0
>
> [ 132.128528] __device_attach+0xe2/0x160
>
> [ 132.129095] pci_bus_add_device+0x4a/0x80
>
> [ 132.129659] pci_bus_add_devices+0x2c/0x70
>
> [ 132.130213] pci_bus_add_devices+0x65/0x70
>
> [ 132.130753] pci_bus_add_devices+0x65/0x70
>
> [ 132.131283] pci_bus_add_devices+0x65/0x70
>
> [ 132.131780] pci_bus_add_devices+0x65/0x70
>
> [ 132.132270] pci_bus_add_devices+0x65/0x70
>
> [ 132.132757] pci_rescan_bus+0x23/0x30
>
> [ 132.133233] rescan_store+0x61/0x90
>
> [ 132.133701] kernfs_fop_write_iter+0x132/0x1b0
>
> [ 132.134167] new_sync_write+0x11f/0x1b0
>
> [ 132.134627] vfs_write+0x35b/0x3b0
>
> [ 132.135062] ksys_write+0xa7/0xe0
>
> [ 132.135503] do_syscall_64+0x34/0x80
>
> [ 132.135933] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> [ 132.136358] RIP: 0033:0x7f0058a73224
>
> [ 132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00
> 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f
> 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>
> [ 132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
>
> [ 132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
> 00007f0058a73224
>
> [ 132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI:
> 0000000000000001
>
> [ 132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09:
> 0000000000000001
>
> [ 132.139532] R10: 000000000000000a R11: 0000000000000246 R12:
> 00007f0058d4f760
>
> [ 132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15:
> 00007f0058d4a760
>
> [ 132.140485] </TASK>
>
> [ 132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode
> is not available
>
> [ 132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode
> is not available
>
> [ 132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode
> is not available
>
> [ 132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay
> ta ucode is not available
>
> [ 132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
>
> [ 132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table
> revision(format.content): 4.6
>
> [ 132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
>
> [ 132.190039] [drm] kiq ring mec 2 pipe 1 q 0
>
> [ 132.203608] [drm] VCN decode and encode initialized
> successfully(under DPG Mode).
>
> [ 132.204178] [drm] JPEG decode initialized successfully.
>
> [ 132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>
> [ 132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
>
> [ 132.328139] amdgpu: HMM registered 32752MB device memory
>
> [ 132.328784] amdgpu: Virtual CRAT table created for GPU
>
> [ 132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
>
> [ 132.330387] kfd kfd: amdgpu: added device 1002:738c
>
> [ 132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH
> 16, active_cu_number 72
>
> [ 132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv
> eng 0 on hub 0
>
> [ 132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv
> eng 1 on hub 0
>
> [ 132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv
> eng 4 on hub 0
>
> [ 132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv
> eng 5 on hub 0
>
> [ 132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv
> eng 6 on hub 0
>
> [ 132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv
> eng 7 on hub 0
>
> [ 132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv
> eng 8 on hub 0
>
> [ 132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv
> eng 9 on hub 0
>
> [ 132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv
> eng 10 on hub 0
>
> [ 132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0
> on hub 1
>
> [ 132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1
> on hub 1
>
> [ 132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4
> on hub 1
>
> [ 132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5
> on hub 1
>
> [ 132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6
> on hub 1
>
> [ 132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0
> on hub 2
>
> [ 132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1
> on hub 2
>
> [ 132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4
> on hub 2
>
> [ 132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv
> eng 5 on hub 2
>
> [ 132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv
> eng 6 on hub 2
>
> [ 132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv
> eng 7 on hub 2
>
> [ 132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv
> eng 8 on hub 2
>
> [ 132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv
> eng 9 on hub 2
>
> [ 132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv
> eng 10 on hub 2
>
> [ 132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv
> eng 11 on hub 2
>
> [ 132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv
> eng 12 on hub 2
>
> [ 132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
>
> [ 132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0
> on minor 1
>
> [ 132.388530] pcieport 0000:d7:00.0: bridge window [io 0x1000-0x0fff]
> to [bus d8] add_size 1000
>
> [ 132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io size 0x1000]
>
> [ 132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io size
> 0x1000]
>
> [ 132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io size 0x1000]
>
> [ 132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io size
> 0x1000]
>
> [ 155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp
> 00007ffc9b3bb610 error 4 in
> libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
>
> [ 155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8
> c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45
> f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> *Date: *Wednesday, April 6, 2022 at 10:36 PM
> *To: *Shuotao Xu <shuotaoxu at microsoft.com>,
> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc: *Ziyue Yang <Ziyue.Yang at microsoft.com>, Lei Qu
> <Lei.Qu at microsoft.com>, Peng Cheng <pengc at microsoft.com>, Ran Shu
> <Ran.Shu at microsoft.com>
> *Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky at amd.com. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
>
> Can you attach dmesg for the failure without your patch against
> amd-staging-drm-next ?
>
> Also, in general, patches for amdgpu upstream branches should be
> submitted to amd-gfx mailing list inline using git-send which makes it
> easy to comment and review them inline.
>
> Andrey
>
> On 2022-04-06 10:25, Shuotao Xu wrote:
>> Hi Andrey,
>>
>> We just tried kernel 5.16 based on
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>>
>> amd-staging-drm-next branch, and found out that hotplug did not work out
>> of box for Rocm compute stack.
>>
>> We did not try the rendering stack since we currently are more focused
>> on AI workloads.
>>
>> We have also created a patch against the amd-staging-drm-next branch to
>> enable hotplug for ROCM stack, which were sent in another later email
>> with same subject. I am attaching the patch in this email, in case that
>> you would want to delete that later email.
>>
>> Best regards,
>>
>> Shuotao
>>
>> *From: *Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> *Date: *Wednesday, April 6, 2022 at 10:13 PM
>> *To: *Shuotao Xu <shuotaoxu at microsoft.com>,
>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *Cc: *Ziyue Yang <Ziyue.Yang at microsoft.com>, Lei Qu
>> <Lei.Qu at microsoft.com>, Peng Cheng <pengc at microsoft.com>, Ran Shu
>> <Ran.Shu at microsoft.com>
>> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>>
>> [You don't often get email from andrey.grodzovsky at amd.com. Learn why
>> this is important at http://aka.ms/LearnAboutSenderIdentification.]
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>>
>>
>> Looks like you are using 5.13 kernel for this work, FYI we added
>> hot plug support for the graphic stack in 5.14 kernel (see
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&reserved=0)
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>>
>>
>>
>> I am not sure about the code part since it all touches KFD driver (KFD
>> team can comment on that) - but I was just wondering if you try 5.14
>> kernel would things just work for you out of the box ?
>>
>> Andrey
>>
>> On 2022-04-05 22:45, Shuotao Xu wrote:
>>> Dear AMD Colleagues,
>>>
>>> We are from Microsoft Research, and are working on GPU disaggregation
>>> technology.
>>>
>>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>>> (github.com)
>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&reserved=0
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0WQ9S94HsQYwhgoM5MhqtkZOP1mfsaiLrDqoEZh1YkU%3D&reserved=0>>>in
>>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>>
>>> We believe the support of hot-plug of GPU devices can open doors for
>>> many advanced applications in data center in the next few years, and we
>>> would like to have some reviewers on this PR so we can continue further
>>> technical discussions around this feature.
>>>
>>> Would you please help review this PR?
>>>
>>> Thank you very much!
>>>
>>> Best regards,
>>>
>>> Shuotao Xu
>>>
>>
>
More information about the amd-gfx
mailing list