[PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()

Christian König christian.koenig at amd.com
Thu Apr 10 12:18:50 UTC 2025


Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
>> Contains two patches improving nouveau_fence_done(), and one
>> addressing
>> an actual bug (race):
> Oops, that's the wrong calltrace. Here we go:
>
> [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]---

I think I understand the problem now as well, but that backtrace is completely mangled in the mail.

It would be nice if you could send that out again.

Thanks,
Christian.

>
> By the way, for reference:
> I did try whether it could be done to have nouveau_fence_signal()
> incorporated into nouveau_fence_update() and nouveau_fence_done().
> This, however, would then cause a race with the list_del() in
> nouveau_fence_no_signaling(), WARNing because of the list poison.
>
> So the "solution" space is:
>  * A cleanup callback on the dma_fence.
>  * Keeping the current race or
>  * replacing it with another race with another function.
>  * Just preventing nouveau_fence_done() from signaling fences other
>    than through nouveau_fence_update/signal
>
> The later seems clearly like the cleanest solution to me. Alternative
> would be a work-intensive rework of all the misdesigns broken in
> nouveau_fence.c
>
>
> P.
>
>> [   39.848463] WARNING: CPU: 21 PID: 1734 at
>> drivers/gpu/drm/nouveau/nouveau_fence.c:509
>> nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
>> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
>> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
>> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
>> nf_tables qrtr sunrpc snd_sof_pci_intel_
>> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
>> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
>> snd_sof_intel_hda snd_sof snd_sof_utils snd
>> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
>> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
>> snd_soc_hda_codec intel_rapl_msr snd_hda_
>> codec_realtek snd_hda_ext_core intel_rapl_common
>> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
>> intel_uncore_frequency intel_uncore_frequency_common snd_hd
>> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit
>> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc
>> snd_hwdep snd_hda_core snd_seq sn
>> d_seq_device dell_wmi
>> [   39.848575]  dell_pc x86_pkg_temp_thermal spi_nor platform_profile
>> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
>> cxl_port iTCO_wdt mtd rapl intel
>> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
>> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
>> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
>> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore
>> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec
>> e1000e macsec mei_me i2c_i801 
>> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink
>> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched
>> polyval_generic rtsx_pci_sdmm
>> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3
>> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
>> nvme_core idxd_bus rtsx_pci nvme_au
>> th pinctrl_alderlake ip6_tables ip_tables fuse
>> [   39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted:
>> G        W          6.14.0-rc4+ #11
>> [   39.848605] Tainted: [W]=WARN
>> [   39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
>> BIOS 2.7.0 12/17/2024
>> [   39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
>> [nouveau]
>> [   39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43
>> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5
>> f0 eb 96 <0f> 0b e9 67 ff ff f
>> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
>> [   39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
>> [   39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
>> ff175a3b4801e008
>> [   39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
>> ff175a3b504da980
>> [   39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
>> 0000000000000001
>> [   39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
>> ff175a3b6d97de00
>> [   39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
>> 0000000000000001
>> [   39.848696] FS:  00007fc5477846c0(0000) GS:ff175a5a50280000(0000)
>> knlGS:0000000000000000
>> [   39.848698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
>> 0000000000f71ef0
>> [   39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [   39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>> 0000000000000400
>> [   39.848702] PKRU: 55555554
>> [   39.848703] Call Trace:
>> [   39.848704]  <TASK>
>> [   39.848705]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848782]  ? __warn.cold+0x93/0xfa
>> [   39.848785]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848861]  ? report_bug+0xff/0x140
>> [   39.848863]  ? handle_bug+0x58/0x90
>> [   39.848865]  ? exc_invalid_op+0x17/0x70
>> [   39.848866]  ? asm_exc_invalid_op+0x1a/0x20
>> [   39.848870]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848943]  nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
>> [   39.849016]  ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
>> [   39.849088]  __dma_fence_enable_signaling+0x33/0xc0
>> [   39.849090]  dma_fence_add_callback+0x4b/0xd0
>> [   39.849093]  nouveau_fence_emit+0xa3/0x260 [nouveau]
>> [   39.849166]  nouveau_fence_new+0x7d/0xf0 [nouveau]
>> [   39.849242]  nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
>> [   39.849338]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849431]  drm_ioctl_kernel+0xad/0x100
>> [   39.849433]  drm_ioctl+0x288/0x550
>> [   39.849435]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849526]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
>> [   39.849620]  __x64_sys_ioctl+0x94/0xc0
>> [   39.849621]  do_syscall_64+0x82/0x160
>> [   39.849623]  ? drm_ioctl+0x2b7/0x550
>> [   39.849625]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849719]  ? ktime_get_mono_fast_ns+0x38/0xd0
>> [   39.849721]  ? __pm_runtime_suspend+0x69/0xc0
>> [   39.849724]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
>> [   39.849726]  ? syscall_exit_to_user_mode+0x10/0x200
>> [   39.849729]  ? do_syscall_64+0x8e/0x160
>> [   39.849730]  ? exc_page_fault+0x7e/0x1a0
>> [   39.849733]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [   39.849735] RIP: 0033:0x7fc5576fe0ad
>> [   39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
>> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
>> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28
>> 00 00 00
>> [   39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [   39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
>> 00007fc5576fe0ad
>> [   39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
>> 000000000000000e
>> [   39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
>> 000055cb74e35560
>> [   39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
>> 00007ffc00268960
>> [   39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
>> 000055cb74e3cd10
>> [   39.849746]  </TASK>
>> [   39.849746] ---[ end trace 0000000000000000 ]---
>> [   39.849776] ------------[ cut here ]------------
>>
>>
>> This is the first WARN_ON() in dma_fence_set_error(), called by
>> nouveau_fence_context_kill().
>>
>> It's rare, but it is a bug, or rather: the archetype of a race, since
>> (as Christian pointed out) nouveau_fence_update() later at some point
>> will remove the signaled fence (by signaling it again).
>>
>>
>> P.
>>
>>
>> Philipp Stanner (3):
>>   drm/nouveau: Prevent signaled fences in pending list
>>   drm/nouveau: Remove surplus if-branch
>>   drm/nouveau: Add helper to check base fence
>>
>>  drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++---------
>> --
>>  1 file changed, 18 insertions(+), 14 deletions(-)
>>



More information about the dri-devel mailing list