[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

Wed Apr 6 15:39:30 UTC 2022

+ Felix

On 2022-04-06 11:11, Shuotao Xu wrote:
> Hi Andrey,
> 
> Thanks for your kind comment on linux patch submission protocol, please 
> let me know if there is anyway to rectify it.

Just resend your patch to amd-gfx mailoing list using
git-send (see here how to use it - 
https://burzalodowa.wordpress.com/2013/10/05/how-to-send-patches-with-git-send-email/)

I suggest adding --cover-letter so you will be able to explain the
story behind the patch.

> 
> dmesg is fine except with some warning during pci rescan after pci 
> removal of an AMD MI100.
> 
> The issue is that after this rocm application will return segfault with 
> the amdgpu driver unless the entire amdgpu kernel module is unloaded and 
> loaded, which we did not meet our hotplug requirement. The issues upon 
> investigation are
> 
> 1) kfd_lock is locked after hotplug, and kfd_open will return fault 
> right away to libhsakmt .

I see now, kfd_lock is static and so single isntance across all devices
and so not going away after device removal but only after driver unload.
In this case I am not sure it's the best idea to just decrement kfd_lock
on device inint since in multi GPU system this might be locked on 
purpuse because another device is going through reset for example right
at this moment.

Felix, kgd2kfd_suspend is called also during device pci remove meaning
unblalanced decrement of the lock. Maybe we should not decremnt I
adding drm_dev_enter guard in kgd2kfd_suspend to avoid decrment of
kfd_locked if we are during PCI remove.

> 
> 2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found 
> such anomalies and return error.

Can you point to what abnormalities ? Part of PCI hot unplug we clean
all sysfs files and this looks like part of it, do you see sysfs file
already exsist error on next pci_rescan ?

> 
> Our patch has been tested with a single-instance AMD MI100 GPU and 
> showed it worked.

Exactly, for multi GPU system arbitrary decrementing kfd_lock on device load
can be problematic.

Andrey

> 
> I am attaching the dmesg after rescan anyway, which will show the 
> warning and the segfault.
> 
> [  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
> 
> [  132.054856] pci 0000:43:00.0: reg 0x10: [mem 
> 0x38b000000000-0x38b7ffffffff 64bit pref]
> 
> [  132.054877] pci 0000:43:00.0: reg 0x18: [mem 
> 0x38b800000000-0x38b8001fffff 64bit pref]
> 
> [  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
> 
> [  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
> 
> [  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
> 
> [  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
> 
> [  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, 
> limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 
> Gb/s with 16.0 GT/s PCIe x16 link)
> 
> [  132.056001] pci 0000:43:00.0: Adding to iommu group 73
> 
> [  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 
> 0x38b000000000-0x38b7ffffffff 64bit pref]
> 
> [  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 
> 0x38b800000000-0x38b8001fffff 64bit pref]
> 
> [  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
> 
> [  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 
> 0xb8480000-0xb849ffff pref]
> 
> [  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]
> 
> [  132.058429] ======================================================
> 
> [  132.058453] WARNING: possible circular locking dependency detected
> 
> [  132.058477] 5.16.0-kfd+ #1 Not tainted
> 
> [  132.058492] ------------------------------------------------------
> 
> [  132.058515] bash/3632 is trying to acquire lock:
> 
> [  132.058534] ffffadee20adfb50 
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
> 
> [  132.058554] [drm] initializing kernel modesetting (ARCTURUS 
> 0x1002:0x738C 0x1002:0x0C34 0x01).
> 
> [  132.058577]
> 
>                 but task is already holding lock:
> 
> [  132.058580] ffffffffa3c62308
> 
> [  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) 
> feature not supported
> 
> [  132.058638]  (
> 
> [  132.058678] [drm] register mmio base: 0xB8400000
> 
> [  132.058683] pci_rescan_remove_lock
> 
> [  132.058694] [drm] register mmio size: 524288
> 
> [  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
> 
> [  132.058773]
> 
>                 which lock already depends on the new lock.
> 
> [  132.058804]
> 
>                 the existing dependency chain (in reverse order) is:
> 
> [  132.058819] [drm] add ip block number 0 <soc15_common>
> 
> [  132.058831]
> 
>                 -> #1 (
> 
> [  132.058854] [drm] add ip block number 1 <gmc_v9_0>
> 
> [  132.058858] [drm] add ip block number 2 <vega20_ih>
> 
> [  132.058874] pci_rescan_remove_lock
> 
> [  132.058894] [drm] add ip block number 3 <psp>
> 
> [  132.058915] ){+.+.}-{3:3}
> 
> [  132.058931] [drm] add ip block number 4 <smu>
> 
> [  132.058951] :
> 
> [  132.058965] [drm] add ip block number 5 <gfx_v9_0>
> 
> [  132.058986]        __mutex_lock+0xa4/0x990
> 
> [  132.058996] [drm] add ip block number 6 <sdma_v4_0>
> 
> [  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
> 
> [  132.059033] [drm] add ip block number 7 <vcn_v2_5>
> 
> [  132.059054]        i801_add_tco+0xf6/0x110
> 
> [  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
> 
> [  132.059096]        i801_probe+0x402/0x860
> 
> [  132.059151]        local_pci_probe+0x40/0x90
> 
> [  132.059170]        work_for_cpu_fn+0x10/0x20
> 
> [  132.059189]        process_one_work+0x2a4/0x640
> 
> [  132.059208]        worker_thread+0x228/0x3f0
> 
> [  132.059227]        kthread+0x16d/0x1a0
> 
> [  132.059795]        ret_from_fork+0x1f/0x30
> 
> [  132.060337]
> 
>                 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
> 
> [  132.061405]        __lock_acquire+0x1552/0x1ac0
> 
> [  132.061950]        lock_acquire+0x26c/0x300
> 
> [  132.062484]        __flush_work+0x315/0x470
> 
> [  132.063009]        work_on_cpu+0x98/0xc0
> 
> [  132.063526]        pci_device_probe+0x1bc/0x1d0
> 
> [  132.064036]        really_probe+0x102/0x450
> 
> [  132.064532]        __driver_probe_device+0x100/0x170
> 
> [  132.065020]        driver_probe_device+0x1f/0xa0
> 
> [  132.065497]        __device_attach_driver+0x6b/0xe0
> 
> [  132.065975]        bus_for_each_drv+0x6a/0xb0
> 
> [  132.066449]        __device_attach+0xe2/0x160
> 
> [  132.066912]        pci_bus_add_device+0x4a/0x80
> 
> [  132.067365]        pci_bus_add_devices+0x2c/0x70
> 
> [  132.067812]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068253]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068688]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
> 
> [  132.069109]        pci_bus_add_devices+0x65/0x70
> 
> [  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
> 
> [  132.070058]        pci_bus_add_devices+0x65/0x70
> 
> [  132.070572] [drm] VCN(0) decode is enabled in VM mode
> 
> [  132.070997]        pci_rescan_bus+0x23/0x30
> 
> [  132.071000]        rescan_store+0x61/0x90
> 
> [  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
> 
> [  132.071501] [drm] VCN(1) decode is enabled in VM mode
> 
> [  132.071964]        new_sync_write+0x11f/0x1b0
> 
> [  132.072432] [drm] VCN(0) encode is enabled in VM mode
> 
> [  132.072900]        vfs_write+0x35b/0x3b0
> 
> [  132.073376] [drm] VCN(1) encode is enabled in VM mode
> 
> [  132.073847]        ksys_write+0xa7/0xe0
> 
> [  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
> 
> [  132.074803]        do_syscall_64+0x34/0x80
> 
> [  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> [  132.074811]
> 
>                 other info that might help us debug this:
> 
> [  132.074813]  Possible unsafe locking scenario:
> 
> [  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
> 
> [  132.075779]        CPU0                    CPU1
> 
> [  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
> 
> [  132.076765]        ----                    ----
> 
> [  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
> 
> [  132.078649]   lock(pci_rescan_remove_lock);
> 
> [  132.078652]                                
> lock((work_completion)(&wfc.work));
> 
> [  132.078653]                                lock(pci_rescan_remove_lock);
> 
> [  132.078655]   lock((work_completion)(&wfc.work));
> 
> [  132.078656]
> 
>                  *** DEADLOCK ***
> 
> [  132.078656] 5 locks held by bash/3632:
> 
> [  132.078658]  #0: ffff9c39c7b89438
> 
> [  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized 
> successfully, hardware ability[7fff] ras_mask[7fff]
> 
> [  132.080089]  (
> 
> [  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 
> 9-bit, fragment size is 9-bit
> 
> [  132.081082] sb_writers
> 
> [  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 
> 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
> 
> [  132.082102] #6
> 
> [  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 
> 0x0000000000000000 - 0x000000001FFFFFFF
> 
> [  132.083152] ){.+.+}-{0:0}
> 
> [  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 
> 0x0000008800000000 - 0x0000FFFFFFFFFFFF
> 
> [  132.084210] , at: ksys_write+0xa7/0xe0
> 
> [  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
> 
> [  132.086205]  #1:
> 
> [  132.086733] [drm] RAM width 4096bits HBM
> 
> [  132.087269] ffff9c5959011088
> 
> [  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
> 
> [  132.088389]  (
> 
> [  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
> 
> [  132.089572] &of->mutex
> 
> [  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
> 
> [  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
> 
> [  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: 
> kernfs_fop_write_iter+0x10c/0x1b0
> 
> [  132.091639] [drm] PCIE GART of 512M enabled.
> 
> [  132.092117]  #3:
> 
> [  132.092801] [drm] PTB located at 0x0000008000000000
> 
> [  132.093480] ffffffffa3c62308
> 
> [  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't 
> exist
> 
> [  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: 
> rescan_store+0x55/0x90
> 
> [  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: 
> __device_attach+0x39/0x160
> 
> [  132.094835]
> 
>                 stack backtrace:
> 
> [  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 
> Revision: 21
> 
> [  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
> 
> [  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN 
> firmware
> 
> [  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, 
> BIOS 2.1 08/14/2018
> 
> [  132.098841] Call Trace:
> 
> [  132.098842]  <TASK>
> 
> [  132.098843]  dump_stack_lvl+0x44/0x57
> 
> [  132.098848]  check_noncircular+0x105/0x120
> 
> [  132.098853]  ? unwind_get_return_address+0x1b/0x30
> 
> [  132.112924]  ? register_lock_class+0x46/0x780
> 
> [  132.113630]  ? __lock_acquire+0x1552/0x1ac0
> 
> [  132.114342]  __lock_acquire+0x1552/0x1ac0
> 
> [  132.115050]  lock_acquire+0x26c/0x300
> 
> [  132.115755]  ? __flush_work+0x2f5/0x470
> 
> [  132.116460]  ? lock_is_held_type+0xdf/0x130
> 
> [  132.117177]  __flush_work+0x315/0x470
> 
> [  132.117890]  ? __flush_work+0x2f5/0x470
> 
> [  132.118604]  ? lock_is_held_type+0xdf/0x130
> 
> [  132.119305]  ? mark_held_locks+0x49/0x70
> 
> [  132.119981]  ? queue_work_on+0x2f/0x70
> 
> [  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
> 
> [  132.121300]  work_on_cpu+0x98/0xc0
> 
> [  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
> 
> [  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
> 
> [  132.123270]  ? pci_device_shutdown+0x60/0x60
> 
> [  132.123880]  pci_device_probe+0x1bc/0x1d0
> 
> [  132.124475]  really_probe+0x102/0x450
> 
> [  132.125060]  __driver_probe_device+0x100/0x170
> 
> [  132.125641]  driver_probe_device+0x1f/0xa0
> 
> [  132.126215]  __device_attach_driver+0x6b/0xe0
> 
> [  132.126797]  ? driver_allows_async_probing+0x50/0x50
> 
> [  132.127383]  ? driver_allows_async_probing+0x50/0x50
> 
> [  132.127960]  bus_for_each_drv+0x6a/0xb0
> 
> [  132.128528]  __device_attach+0xe2/0x160
> 
> [  132.129095]  pci_bus_add_device+0x4a/0x80
> 
> [  132.129659]  pci_bus_add_devices+0x2c/0x70
> 
> [  132.130213]  pci_bus_add_devices+0x65/0x70
> 
> [  132.130753]  pci_bus_add_devices+0x65/0x70
> 
> [  132.131283]  pci_bus_add_devices+0x65/0x70
> 
> [  132.131780]  pci_bus_add_devices+0x65/0x70
> 
> [  132.132270]  pci_bus_add_devices+0x65/0x70
> 
> [  132.132757]  pci_rescan_bus+0x23/0x30
> 
> [  132.133233]  rescan_store+0x61/0x90
> 
> [  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
> 
> [  132.134167]  new_sync_write+0x11f/0x1b0
> 
> [  132.134627]  vfs_write+0x35b/0x3b0
> 
> [  132.135062]  ksys_write+0xa7/0xe0
> 
> [  132.135503]  do_syscall_64+0x34/0x80
> 
> [  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> [  132.136358] RIP: 0033:0x7f0058a73224
> 
> [  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 
> 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 
> 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
> 
> [  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000001
> 
> [  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
> 00007f0058a73224
> 
> [  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 
> 0000000000000001
> 
> [  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 
> 0000000000000001
> 
> [  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 
> 00007f0058d4f760
> 
> [  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 
> 00007f0058d4a760
> 
> [  132.140485]  </TASK>
> 
> [  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode 
> is not available
> 
> [  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode 
> is not available
> 
> [  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode 
> is not available
> 
> [  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay 
> ta ucode is not available
> 
> [  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
> 
> [  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table 
> revision(format.content): 4.6
> 
> [  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
> 
> [  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
> 
> [  132.203608] [drm] VCN decode and encode initialized 
> successfully(under DPG Mode).
> 
> [  132.204178] [drm] JPEG decode initialized successfully.
> 
> [  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> 
> [  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
> 
> [  132.328139] amdgpu: HMM registered 32752MB device memory
> 
> [  132.328784] amdgpu: Virtual CRAT table created for GPU
> 
> [  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
> 
> [  132.330387] kfd kfd: amdgpu: added device 1002:738c
> 
> [  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 
> 16, active_cu_number 72
> 
> [  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv 
> eng 0 on hub 0
> 
> [  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv 
> eng 1 on hub 0
> 
> [  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv 
> eng 4 on hub 0
> 
> [  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv 
> eng 5 on hub 0
> 
> [  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv 
> eng 6 on hub 0
> 
> [  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv 
> eng 7 on hub 0
> 
> [  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv 
> eng 8 on hub 0
> 
> [  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv 
> eng 9 on hub 0
> 
> [  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv 
> eng 10 on hub 0
> 
> [  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 
> on hub 1
> 
> [  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 
> on hub 1
> 
> [  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 
> on hub 1
> 
> [  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 
> on hub 1
> 
> [  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 
> on hub 1
> 
> [  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 
> on hub 2
> 
> [  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 
> on hub 2
> 
> [  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 
> on hub 2
> 
> [  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv 
> eng 5 on hub 2
> 
> [  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv 
> eng 6 on hub 2
> 
> [  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv 
> eng 7 on hub 2
> 
> [  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv 
> eng 8 on hub 2
> 
> [  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv 
> eng 9 on hub 2
> 
> [  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv 
> eng 10 on hub 2
> 
> [  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv 
> eng 11 on hub 2
> 
> [  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv 
> eng 12 on hub 2
> 
> [  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
> 
> [  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 
> on minor 1
> 
> [  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] 
> to [bus d8] add_size 1000
> 
> [  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
> 
> [  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 
> 0x1000]
> 
> [  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
> 
> [  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 
> 0x1000]
> 
> [  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 
> 00007ffc9b3bb610 error 4 in 
> libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
> 
> [  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 
> c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 
> f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8
> 
> Best regards,
> 
> Shuotao
> 
> *From: *Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> *Date: *Wednesday, April 6, 2022 at 10:36 PM
> *To: *Shuotao Xu <shuotaoxu at microsoft.com>, 
> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc: *Ziyue Yang <Ziyue.Yang at microsoft.com>, Lei Qu 
> <Lei.Qu at microsoft.com>, Peng Cheng <pengc at microsoft.com>, Ran Shu 
> <Ran.Shu at microsoft.com>
> *Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
> 
> [You don't often get email from andrey.grodzovsky at amd.com. Learn why 
> this is important at http://aka.ms/LearnAboutSenderIdentification.] 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
> 
> Can you attach dmesg for the failure without your patch against
> amd-staging-drm-next ?
> 
> Also, in general, patches for amdgpu upstream branches should be
> submitted to amd-gfx mailing list inline using git-send which makes it
> easy to comment and review them inline.
> 
> Andrey
> 
> On 2022-04-06 10:25, Shuotao Xu wrote:
>> Hi Andrey,
>>
>> We just tried kernel 5.16 based on
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>>
>> amd-staging-drm-next branch, and found out that hotplug did not work out
>> of box for Rocm compute stack.
>>
>> We did not try the rendering stack since we currently are more focused
>> on AI workloads.
>>
>> We have also created a patch against the amd-staging-drm-next branch to
>> enable hotplug for ROCM stack, which were sent in another later email
>> with same subject. I am attaching the patch in this email, in case that
>> you would want to delete that later email.
>>
>> Best regards,
>>
>> Shuotao
>>
>> *From: *Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> *Date: *Wednesday, April 6, 2022 at 10:13 PM
>> *To: *Shuotao Xu <shuotaoxu at microsoft.com>,
>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *Cc: *Ziyue Yang <Ziyue.Yang at microsoft.com>, Lei Qu
>> <Lei.Qu at microsoft.com>, Peng Cheng <pengc at microsoft.com>, Ran Shu
>> <Ran.Shu at microsoft.com>
>> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>>
>> [You don't often get email from andrey.grodzovsky at amd.com. Learn why
>> this is important at http://aka.ms/LearnAboutSenderIdentification.] 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>>
>>
>> Looks like you are using 5.13 kernel for this work, FYI we added
>> hot plug support for the graphic stack in 5.14 kernel (see
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&reserved=0) 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>>
>>
>>
>> I am not sure about the code part since it all touches KFD driver (KFD
>> team can comment on that) - but I was just wondering if you try 5.14
>> kernel would things just work for you out of the box ?
>>
>> Andrey
>>
>> On 2022-04-05 22:45, Shuotao Xu wrote:
>>> Dear AMD Colleagues,
>>>
>>> We are from Microsoft Research, and are working on GPU disaggregation
>>> technology.
>>>
>>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>>> (github.com)
>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&reserved=0
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0WQ9S94HsQYwhgoM5MhqtkZOP1mfsaiLrDqoEZh1YkU%3D&reserved=0>>>in
>>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>>
>>> We believe the support of hot-plug of GPU devices can open doors for
>>> many advanced applications in data center in the next few years, and we
>>> would like to have some reviewers on this PR so we can continue further
>>> technical discussions around this feature.
>>>
>>> Would you please help review this PR?
>>>
>>> Thank you very much!
>>>
>>> Best regards,
>>>
>>> Shuotao Xu
>>>
>>
>