Kernel crash at reloading amdgpu

Lin, Amber Amber.Lin at amd.com
Wed May 8 20:56:42 UTC 2019


Hi,

When I do "rmmod amdgpu; modprobe amdgpu", kernel crashed. This is 
vega20. What happens is in amdgpu_device_init():


         /* check if we need to reset the asic
          *  E.g., driver was not cleanly unloaded previously, etc.
          */
         if (!amdgpu_sriov_vf(adev) && 
amdgpu_asic_need_reset_on_init(adev)) {
                 r = amdgpu_asic_reset(adev);
                 if (r) {
                         dev_err(adev->dev, "asic reset on init failed\n");
                         goto failed;
                 }
         }

amdgpu_asic_need_reset_on_init()/soc15_need_reset_on_init() returns true 
and it goes to amdgpu_asic_reset()/soc15_asic_mode1_reset(), where it 
calls psp_gpu_reset():

         int psp_gpu_reset(struct amdgpu_device *adev)
         {
             if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP)
                     return 0;

             return psp_mode1_reset(&adev->psp);
         }

Here, however, psp_mode1_reset is NOT assigned as 
psp_v11_0_mode1_reset() until amdgpu_device_ip_init(), which is after 
amdgpu_asic_reset. This null function pointer causes the kernel crash 
and I have to reboot my system.

Does anyone have an idea how to fix this properly?

BTW this is the log:

[  157.686303] PGD 0 P4D 0
[  157.688837] Oops: 0000 [#1] SMP PTI
[  157.692331] CPU: 0 PID: 1902 Comm: kworker/0:2 Tainted: G W         
5.0.0-rc1-kfd+ #6
[  157.700760] Hardware name: ASUS All Series/X99-E WS, BIOS 1302 01/05/2016
[  157.707543] Workqueue: events work_for_cpu_fn
[  157.711976] RIP: 0010:psp_gpu_reset+0x18/0x30 [amdgpu]
[  157.717106] Code: ff ff ff 5b c3 b8 ea ff ff ff c3 0f 1f 80 00 00 00 
00 0f 1f 44 00 00 83 bf c8 22 01 00 02 74 03 31 c0 c3 48 8b 87 c0 23 01 
00 <48> 8b 40 50 48 85 c0 74 ed 48 81 c7 88 23 01 00 e9 03 3b 8d d6 0f
[  157.735852] RSP: 0018:ffffaa2544243ce0 EFLAGS: 00010246
[  157.741077] RAX: 0000000000000000 RBX: ffff97e946f60000 RCX: 
0000000000000000
[  157.748202] RDX: 0000000000000027 RSI: ffffffff976655a0 RDI: 
ffff97e946f60000
[  157.755326] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000002
[  157.762459] R10: ffffaa2544243ba0 R11: 38a79ac3ec19edd5 R12: 
ffff97e946f75088
[  157.769608] R13: 000000000000000a R14: ffff97e946f75128 R15: 
0000000000000001
[  157.776741] FS:  0000000000000000(0000) GS:ffff97e94f800000(0000) 
knlGS:0000000000000000
[  157.784827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  157.790564] CR2: 0000000000000050 CR3: 00000008083e6003 CR4: 
00000000001606f0
[  157.797696] Call Trace:
[  157.800184]  soc15_asic_reset+0x81/0x1f0 [amdgpu]
[  157.804936]  amdgpu_device_init+0xcf1/0x1800 [amdgpu]
[  157.809993]  ? rcu_read_lock_sched_held+0x74/0x80
[  157.814734]  amdgpu_driver_load_kms+0x65/0x270 [amdgpu]

Thanks.

Regards,
Amber


More information about the amd-gfx mailing list