[PATCH] drm/amdkfd: Fix the svm_bo refcount waring
Chen, Xiaogang
xiaogang.chen at amd.com
Mon Sep 25 15:26:04 UTC 2023
On 9/25/2023 8:19 AM, Philip Yang wrote:
>
>
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
>
> On 2023-09-25 05:32, Jesse Zhang wrote:
>> Fix the svm_bo refcount warnging by check the refcount before release.
>>
>> [ 462.649530] ------------[ cut here ]------------
>> [ 462.649532] refcount_t: underflow; use-after-free.
>> [ 462.649536] WARNING: CPU: 7 PID: 1936 at lib/refcount.c:28 refcount_warn_saturate+0xf8/0x150
>> [ 462.649542] Modules linked in: amdgpu(E) amdxcp drm_buddy gpu_sched drm_suballoc_helper drm_ttm_helper ttm(E) drm_display_helper cec rc_core drm_kms_helper i2c_algo_bit syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs tls r8153_ecm cdc_ether usbnet r8152 mii joydev input_leds snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi hid_generic intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd snd_hda_core usbhid snd_hwdep kvm_amd hid kvm snd_pcm sunrpc crct10dif_pclmul ghash_clmulni_intel snd_seq_midi sha512_ssse3 snd_seq_midi_event aesni_intel snd_rawmidi crypto_simd cryptd snd_seq rapl snd_seq_device snd_pci_acp5x snd_timer snd_rn_pci_acp3x wmi_bmof snd_acp_config snd snd_soc_acpi soundcore ccp snd_pci_acp3x k10temp mac_hid amd_pmc sch_fq_codel binfmt_misc msr parport_pc ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 thunderbolt crc32_pclmul nvme i2c_piix4 nvme_core xhci_pci
>> [ 462.649576] xhci_pci_renesas video wmi
>> [ 462.649577] CPU: 7 PID: 1936 Comm: kworker/7:3 Tainted: G E 6.3.7+ #25
>> [ 462.649579] Hardware name: AMD Splinter/Splinter-PHX, BIOS WS43906N_857 09/04/2023
>> [ 462.649580] Workqueue: events svm_range_deferred_list_work [amdgpu]
>> [ 462.649771] RIP: 0010:refcount_warn_saturate+0xf8/0x150
>> [ 462.649773] Code: eb a1 0f b6 1d 7c 58 c7 01 80 fb 01 0f 87 11 64 83 00 83 e3 01 75 8c 48 c7 c7 a0 00 bc b6 c6 05 60 58 c7 01 01 e8 48 1f 9a ff <0f> 0b e9 72 ff ff ff 0f b6 1d 4b 58 c7 01 80 fb 01 0f 87 ce 63 83
>> [ 462.649773] RSP: 0018:ffffb6660603bd88 EFLAGS: 00010286
>> [ 462.649774] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
>> [ 462.649775] RDX: 0000000000000027 RSI: ffffb6660603bc48 RDI: ffff91f77e7e1548
>> [ 462.649776] RBP: ffffb6660603bd90 R08: 0000000000000003 R09: 0000000000000001
>> [ 462.649776] R10: 0000000000000001 R11: 0000000000000028 R12: ffff91f453fb2000
>> [ 462.649777] R13: 00000007f7cfc4c4 R14: ffff91f451f2f480 R15: 00000007f7cfc4c1
>> [ 462.649777] FS: 0000000000000000(0000) GS:ffff91f77e7c0000(0000) knlGS:0000000000000000
>> [ 462.649778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 462.649778] CR2: 00007f7cfc4c9000 CR3: 0000000113c52000 CR4: 0000000000750ee0
>> [ 462.649779] PKRU: 55555554
>> [ 462.649779] Call Trace:
>> [ 462.649780] <TASK>
>> [ 462.649782] ? show_regs+0x6c/0x80
>> [ 462.649784] ? refcount_warn_saturate+0xf8/0x150
>> [ 462.649786] ? __warn+0x93/0x150
>> [ 462.649788] ? refcount_warn_saturate+0xf8/0x150
>> [ 462.649789] ? report_bug+0x1c6/0x1e0
>> [ 462.649791] ? irq_work_queue+0x14/0x50
>> [ 462.649794] ? handle_bug+0x46/0x80
>> [ 462.649796] ? exc_invalid_op+0x1d/0x80
>> [ 462.649797] ? asm_exc_invalid_op+0x1f/0x30
>> [ 462.649799] ? refcount_warn_saturate+0xf8/0x150
>> [ 462.649800] ? refcount_warn_saturate+0xf8/0x150
>> [ 462.649801] svm_range_free+0xeb/0xf0 [amdgpu]
>> [ 462.649907] svm_range_handle_list_op+0x1ae/0x1e0 [amdgpu]
>> [ 462.650000] svm_range_deferred_list_work+0x149/0x2c0 [amdgpu]
>> [ 462.650091] process_one_work+0x21c/0x430
>> [ 462.650094] worker_thread+0x4e/0x3c0
>> [ 462.650095] ? __pfx_worker_thread+0x10/0x10
>> [ 462.650096] kthread+0xf2/0x120
>> [ 462.650098] ? __pfx_kthread+0x10/0x10
>> [ 462.650099] ret_from_fork+0x29/0x50
>> [ 462.650101] </TASK>
>> [ 462.650102] ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Jesse Zhang<Jesse.Zhang at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index fcdde9f451bb..44c3f22cb4a1 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -436,7 +436,7 @@ void svm_range_bo_unref_async(struct svm_range_bo *svm_bo)
>>
>> static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
>> {
>> - if (svm_bo)
>> + if (svm_bo && kref_read(&svm_bo->kref))
>
> This just workaround the issue, looks like it is user-after-free bug
> or svm_bo refcount leaking, we should fix the root cause.
>
> The kernel seems from tip amd-staging-drm-next, is this regression?
>
> Regards,
>
> Philip
>
I also noticed this issue on kfdtest with amd-staging-drm-next. It is a
random and I sought it happened when run kfdsvmrange tests together, not
run each test individually. So it is timing related.
put kref_read before call kref function is equivalent to what kref does
before call release function, not reveal why kfd l unref svm_bo when
svm_bo ref was already zero, something was wrong there already.
Regards
Xiaogang
>> kref_put(&svm_bo->kref, svm_range_bo_release);
>> }
>>
More information about the amd-gfx
mailing list