[RFC PATCH] drm/amdkfd: disable HSA_AMD_SVM on LoongArch and AArch64

Thu Aug 14 15:35:27 UTC 2025

On AArch64 we also noticed problems with HSA_SVM due to virtual address 
limitations on our GPUs. Basically we can only use 47-bit virtual 
addresses for user mode pointers. AArch64 uses 48 bit pointers with 4KB 
pages and even more with 64KB pages.

It should be possible to work around that with "ulimit -v" to limit the 
virtual address space used by the application. Therefore I'd prefer not 
to disable HSA_SVM outright. But instead maybe add address bounds checks 
in svm_range_set_attr.

LoongArch seems to have a different issues. I'd be OK to disable HSA_SVM 
on that arch until more information is available.

Regards,
   Felix

On 2025-08-13 23:21, Mingcong Bai wrote:
> While testing my ROCm port for LoongArch and AArch64 (patches pending) on
> the following platforms:
>
> - LoongArch ...
>    - Loongson AC612A0_V1.1 (Loongson 3C6000/S) + AMD Radeon RX 6800
> - AArch64 ...
>    - FD30M51 (Phytium FT-D3000) + AMD Radeon RX 7600
>    - Huawei D920S10 (Huawei Kunpeng 920) + AMD Radeon RX 7600
>
> When HSA_AMD_SVM is enabled, amdgpu would fail to initialise at all on
> LoongArch (no output):
>
>    amdgpu 0000:0d:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
>    CPU 0 Unable to handle kernel paging request at virtual address ffffffffff800034, era == 9000000001058044, ra == 9000000001058660
>    Oops[#1]:
>    CPU: 0 UID: 0 PID: 202 Comm: kworker/0:3 Not tainted 6.16.0+ #103 PREEMPT(full)
>    Hardware name: To be filled by O.E.M.To be fill To be filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS Loongson-UDK2018-V4.0.
>    Workqueue: events work_for_cpu_fn
>    pc 9000000001058044 ra 9000000001058660 tp 9000000101500000 sp 9000000101503aa0
>    a0 ffffffffff800000 a1 0000000ffffe0000 a2 0000000000000000 a3 90000001207c58e0
>    a4 9000000001a4c310 a5 0000000000000001 a6 0000000000000000 a7 0000000000000001
>    t0 000003ffff800000 t1 0000000000000001 t2 0000040000000000 t3 03ffff0000002000
>    t4 0000000000000000 t5 0001010101010101 t6 ffff800000000000 t7 0001000000000000
>    t8 000000000000002f u0 0000000000800000 s9 9000000002026000 s0 90000001207c58e0
>    s1 0000000000000001 s2 9000000001935c40 s3 0000001000000000 s4 0000000000000001
>    s5 0000000ffffe0000 s6 0000000000000040 s7 0001000000000001 s8 0001000000000000
>       ra: 9000000001058660 memmap_init_zone_device+0x120/0x1b0
>      ERA: 9000000001058044 __init_zone_device_page.constprop.0+0x4/0x1a0
>     CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
>     PRMD: 00000004 (PPLV0 +PIE -PWE)
>     EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
>     ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
>    ESTAT: 00020000 [PIS] (IS= ECode=2 EsubCode=0)
>     BADV: ffffffffff800034
>     PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
>    Modules linked in: amdgpu(+) vfat fat cfg80211 rfkill 8021q garp stp mrp llc snd_hda_codec_atihdmi snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic drm_client_lib drm_ttm_helper syscopyarea ttm sysfillrect sysimgblt fb_sys_fops drm_panel_backlight_quirks video drm_exec drm_suballoc_helper amdxcp mfd_core drm_buddy gpu_sched drm_display_helper drm_kms_helper cec snd_hda_intel ipmi_ssif snd_intel_dspcfg snd_hda_codec snd_hda_core acpi_ipmi snd_hwdep snd_pcm fb loongson3_cpufreq lcd igc snd_timer ipmi_si spi_loongson_pci spi_loongson_core snd ipmi_devintf soundcore ipmi_msghandler binfmt_misc fuse drm drm_panel_orientation_quirks backlight dm_mod dax nfnetlink
>    Process kworker/0:3 (pid: 202, threadinfo=00000000eb7cd5d6, task=000000004ca22b1b)
>    Stack : 0000000000001440 0000000000000000 ffffffffff800000 0000000000000001
>            90000000020b5978 9000000101503b38 0000000000000001 0000000000000001
>            0000000000000000 90000000020b5978 90000000020b3f48 0000000000001440
>            0000000000000000 90000001207c58e0 90000001207c5970 9000000000575e20
>            90000000010e2e00 90000000020b3f48 900000000205c238 0000000000000000
>            00000000000001d3 90000001207c58e0 9000000001958f28 9000000120790848
>            90000001207b3510 0000000000000000 9000000120780000 9000000120780010
>            90000001207d6000 90000001207c58e0 90000001015660c8 9000000120780000
>            0000000000000000 90000000005763a8 90000001207c58e0 00000003ff000000
>            9000000120780000 ffff80000296b820 900000012078f968 90000001207c6000
>            ...
>    Call Trace:
>    [<9000000001058044>] __init_zone_device_page.constprop.0+0x4/0x1a0
>    [<900000000105865c>] memmap_init_zone_device+0x11c/0x1b0
>    [<9000000000575e1c>] memremap_pages+0x24c/0x7b0
>    [<90000000005763a4>] devm_memremap_pages+0x24/0x80
>    [<ffff80000296b81c>] kgd2kfd_init_zone_device+0x11c/0x220 [amdgpu]
>    [<ffff80000265d09c>] amdgpu_device_init+0x27dc/0x2bf0 [amdgpu]
>    [<ffff80000265ece8>] amdgpu_driver_load_kms+0x18/0x90 [amdgpu]
>    [<ffff800002651fbc>] amdgpu_pci_probe+0x22c/0x890 [amdgpu]
>    [<9000000000916adc>] local_pci_probe+0x3c/0xb0
>    [<90000000002976c8>] work_for_cpu_fn+0x18/0x30
>    [<900000000029aeb4>] process_one_work+0x164/0x320
>    [<900000000029b96c>] worker_thread+0x37c/0x4a0
>    [<90000000002a695c>] kthread+0x12c/0x220
>    [<9000000001055b64>] ret_from_kernel_thread+0x24/0xc0
>    [<9000000000237524>] ret_from_kernel_thread_asm+0xc/0x88
>
>    Code: 00000000  00000000  0280040d <2980d08d> 02bffc0e  2980c08e  02c0208d  29c0208d  1400004f
>
>    ---[ end trace 0000000000000000 ]---
>
> Or lock up and/or driver reset during computate tasks, such as when
> running llama.cpp over ROCm, at which point the compute process must be
> killed before the reset could complete:
>
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    amdgpu 0000:0a:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1202
>    amdgpu 0000:0a:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
>    amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 3
>    amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    amdgpu 0000:0a:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1004
>    amdgpu 0000:0a:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
>    amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 2
>    amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 1
>    amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 0
>    amdgpu: Failed to quiesce KFD
>    amdgpu 0000:0a:00.0: amdgpu: Dumping IP State
>    amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
>    [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
>    amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
>    amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
>    amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
>    amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
>
> Disabling the aforementioned option makes the issue go away, though it is
> unclear whether this is a platform-specific issue or one that lies within
> the amdkfd code.
>
> This patch has been tested on all the aforementioned platform
> combinations, and sent as an RFC to encourage discussion.
>
> Signed-off-by: Zhang Yuhao<xinmu at xinmu.moe>
> Signed-off-by: Mingcong Bai<jeffbai at aosc.io>
> Tested-by: Mingcong Bai<jeffbai at aosc.io>
> ---
>   drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig
> index 16e12c9913f94..5d2fa86f60bf8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/Kconfig
> +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
> @@ -14,7 +14,7 @@ config HSA_AMD
>   
>   config HSA_AMD_SVM
>   	bool "Enable HMM-based shared virtual memory manager"
> -	depends on HSA_AMD && DEVICE_PRIVATE
> +	depends on HSA_AMD && DEVICE_PRIVATE && !LOONGARCH && !ARM64
>   	default y
>   	select HMM_MIRROR
>   	select MMU_NOTIFIER
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250814/76d96a79/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: XXsRQJouIAi8Ar0m.png
Type: image/png
Size: 13 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250814/76d96a79/attachment-0001.png>