[PATCH] drm/amdkfd: Fix the svm_bo refcount waring

Mon Sep 25 15:26:04 UTC 2023

On 9/25/2023 8:19 AM, Philip Yang wrote:
>
> 	
> Caution: This message originated from an External Source. Use proper 
> caution when opening attachments, clicking links, or responding.
>
>
>
> On 2023-09-25 05:32, Jesse Zhang wrote:
>> Fix the svm_bo refcount warnging by check the refcount before release.
>>
>> [  462.649530] ------------[ cut here ]------------
>> [  462.649532] refcount_t: underflow; use-after-free.
>> [  462.649536] WARNING: CPU: 7 PID: 1936 at lib/refcount.c:28 refcount_warn_saturate+0xf8/0x150
>> [  462.649542] Modules linked in: amdgpu(E) amdxcp drm_buddy gpu_sched drm_suballoc_helper drm_ttm_helper ttm(E) drm_display_helper cec rc_core drm_kms_helper i2c_algo_bit syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs tls r8153_ecm cdc_ether usbnet r8152 mii joydev input_leds snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi hid_generic intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd snd_hda_core usbhid snd_hwdep kvm_amd hid kvm snd_pcm sunrpc crct10dif_pclmul ghash_clmulni_intel snd_seq_midi sha512_ssse3 snd_seq_midi_event aesni_intel snd_rawmidi crypto_simd cryptd snd_seq rapl snd_seq_device snd_pci_acp5x snd_timer snd_rn_pci_acp3x wmi_bmof snd_acp_config snd snd_soc_acpi soundcore ccp snd_pci_acp3x k10temp mac_hid amd_pmc sch_fq_codel binfmt_misc msr parport_pc ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 thunderbolt crc32_pclmul nvme i2c_piix4 nvme_core xhci_pci
>> [  462.649576]  xhci_pci_renesas video wmi
>> [  462.649577] CPU: 7 PID: 1936 Comm: kworker/7:3 Tainted: G            E      6.3.7+ #25
>> [  462.649579] Hardware name: AMD Splinter/Splinter-PHX, BIOS WS43906N_857 09/04/2023
>> [  462.649580] Workqueue: events svm_range_deferred_list_work [amdgpu]
>> [  462.649771] RIP: 0010:refcount_warn_saturate+0xf8/0x150
>> [  462.649773] Code: eb a1 0f b6 1d 7c 58 c7 01 80 fb 01 0f 87 11 64 83 00 83 e3 01 75 8c 48 c7 c7 a0 00 bc b6 c6 05 60 58 c7 01 01 e8 48 1f 9a ff <0f> 0b e9 72 ff ff ff 0f b6 1d 4b 58 c7 01 80 fb 01 0f 87 ce 63 83
>> [  462.649773] RSP: 0018:ffffb6660603bd88 EFLAGS: 00010286
>> [  462.649774] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
>> [  462.649775] RDX: 0000000000000027 RSI: ffffb6660603bc48 RDI: ffff91f77e7e1548
>> [  462.649776] RBP: ffffb6660603bd90 R08: 0000000000000003 R09: 0000000000000001
>> [  462.649776] R10: 0000000000000001 R11: 0000000000000028 R12: ffff91f453fb2000
>> [  462.649777] R13: 00000007f7cfc4c4 R14: ffff91f451f2f480 R15: 00000007f7cfc4c1
>> [  462.649777] FS:  0000000000000000(0000) GS:ffff91f77e7c0000(0000) knlGS:0000000000000000
>> [  462.649778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  462.649778] CR2: 00007f7cfc4c9000 CR3: 0000000113c52000 CR4: 0000000000750ee0
>> [  462.649779] PKRU: 55555554
>> [  462.649779] Call Trace:
>> [  462.649780]  <TASK>
>> [  462.649782]  ? show_regs+0x6c/0x80
>> [  462.649784]  ? refcount_warn_saturate+0xf8/0x150
>> [  462.649786]  ? __warn+0x93/0x150
>> [  462.649788]  ? refcount_warn_saturate+0xf8/0x150
>> [  462.649789]  ? report_bug+0x1c6/0x1e0
>> [  462.649791]  ? irq_work_queue+0x14/0x50
>> [  462.649794]  ? handle_bug+0x46/0x80
>> [  462.649796]  ? exc_invalid_op+0x1d/0x80
>> [  462.649797]  ? asm_exc_invalid_op+0x1f/0x30
>> [  462.649799]  ? refcount_warn_saturate+0xf8/0x150
>> [  462.649800]  ? refcount_warn_saturate+0xf8/0x150
>> [  462.649801]  svm_range_free+0xeb/0xf0 [amdgpu]
>> [  462.649907]  svm_range_handle_list_op+0x1ae/0x1e0 [amdgpu]
>> [  462.650000]  svm_range_deferred_list_work+0x149/0x2c0 [amdgpu]
>> [  462.650091]  process_one_work+0x21c/0x430
>> [  462.650094]  worker_thread+0x4e/0x3c0
>> [  462.650095]  ? __pfx_worker_thread+0x10/0x10
>> [  462.650096]  kthread+0xf2/0x120
>> [  462.650098]  ? __pfx_kthread+0x10/0x10
>> [  462.650099]  ret_from_fork+0x29/0x50
>> [  462.650101]  </TASK>
>> [  462.650102] ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Jesse Zhang<Jesse.Zhang at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index fcdde9f451bb..44c3f22cb4a1 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -436,7 +436,7 @@ void svm_range_bo_unref_async(struct svm_range_bo *svm_bo)
>>   
>>   static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
>>   {
>> -	if (svm_bo)
>> +	if (svm_bo && kref_read(&svm_bo->kref))
>
> This just workaround the issue, looks like it is user-after-free bug 
> or svm_bo refcount leaking, we should fix the root cause.
>
> The kernel seems from tip amd-staging-drm-next, is this regression?
>
> Regards,
>
> Philip
>
I also noticed this issue on kfdtest with amd-staging-drm-next. It is a 
random and I sought it happened when run kfdsvmrange tests together, not 
run each test individually. So it is timing related.

put kref_read before call kref function is equivalent to what kref does 
before call release function, not reveal why kfd l unref svm_bo when 
svm_bo ref was already zero, something was wrong there already.

Regards

Xiaogang

>>   		kref_put(&svm_bo->kref, svm_range_bo_release);
>>   }
>>