[PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()
Philipp Stanner
phasta at mailbox.org
Thu Apr 10 13:18:43 UTC 2025
On Thu, 2025-04-10 at 14:18 +0200, Christian König wrote:
> Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> > On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
> > > Contains two patches improving nouveau_fence_done(), and one
> > > addressing
> > > an actual bug (race):
> > Oops, that's the wrong calltrace. Here we go:
> >
> > [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791874] ? __warn.cold
> > (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791950] ? report_bug
> > (/home/imperator/linux/lib/bug.c:180
> > /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug
> > (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ?
> > exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309
> > (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op
> > (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [
> > 85.791960] ? nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold
> > (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/schedu
> > ler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [
> > 85.792033] ? drm_sched_entity_kill.part.0
> > (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243
> > (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518)
> > nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188)
> > nouveau [ 85.792191] nouveau_abi16_fini
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224
> > (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240)
> > nouveau [ 85.792349] drm_file_free
> > (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353]
> > drm_release
> > (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67
> > (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-arch-
> > fallback.h:2278 (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-
> > instrumented.h:1384 (discriminator 1)
> > /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator
> > 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464)
> > [ 85.792357] task_work_run
> > (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit
> > (/home/imperator/linux/kernel/exit.c:939) [ 85.792362]
> > do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [
> > 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036)
> > [ 85.792366] arch_do_signal_or_restart
> > (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38
> > /home/imperator/linux/arch/x86/kernel/signal.c:264
> > /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369]
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/kernel/entry/common.c:113
> > /home/imperator/linux/./include/linux/entry-common.h:329
> > /home/imperator/linux/kernel/entry/common.c:207
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372]
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ?
> > syscall_exit_to_user_mode_prepare
> > (/home/imperator/linux/./include/linux/audit.h:357
> > /home/imperator/linux/kernel/entry/common.c:166
> > /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ?
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686
> > /home/imperator/linux/./include/linux/entry-common.h:232
> > /home/imperator/linux/kernel/entry/common.c:206
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ?
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378]
> > entry_SYSCALL_64_after_hwframe
> > (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381]
> > RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode
> > bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such
> > file Code starting with the faulting instruction
> > =========================================== [ 85.792383] RSP:
> > 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [
> > 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX:
> > 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI:
> > 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP:
> > 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [
> > 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12:
> > 0000000000000001 [ 85.792388] R13: 0000000000000000 R14:
> > 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [
> > 85.792391] ---[ end trace 0000000000000000 ]---
>
> I think I understand the problem now as well, but that backtrace is
> completely mangled in the mail.
>
> It would be nice if you could send that out again.
I really need to install Mutt soon..
Let's try it this way:
https://paste.debian.net/1368679/
P.
>
> Thanks,
> Christian.
>
> >
> > By the way, for reference:
> > I did try whether it could be done to have nouveau_fence_signal()
> > incorporated into nouveau_fence_update() and nouveau_fence_done().
> > This, however, would then cause a race with the list_del() in
> > nouveau_fence_no_signaling(), WARNing because of the list poison.
> >
> > So the "solution" space is:
> > * A cleanup callback on the dma_fence.
> > * Keeping the current race or
> > * replacing it with another race with another function.
> > * Just preventing nouveau_fence_done() from signaling fences other
> > than through nouveau_fence_update/signal
> >
> > The later seems clearly like the cleanest solution to me.
> > Alternative
> > would be a work-intensive rework of all the misdesigns broken in
> > nouveau_fence.c
> >
> >
> > P.
> >
> > > [ 39.848463] WARNING: CPU: 21 PID: 1734 at
> > > drivers/gpu/drm/nouveau/nouveau_fence.c:509
> > > nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
> > > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
> > > t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> > > nf_tables qrtr sunrpc snd_sof_pci_intel_
> > > tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> > > snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> > > snd_sof_intel_hda snd_sof snd_sof_utils snd
> > > _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
> > > snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
> > > snd_soc_hda_codec intel_rapl_msr snd_hda_
> > > codec_realtek snd_hda_ext_core intel_rapl_common
> > > snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
> > > intel_uncore_frequency intel_uncore_frequency_common snd_hd
> > > a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common
> > > nfit
> > > snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec
> > > binfmt_misc
> > > snd_hwdep snd_hda_core snd_seq sn
> > > d_seq_device dell_wmi
> > > [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor
> > > platform_profile
> > > sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
> > > cxl_port iTCO_wdt mtd rapl intel
> > > _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
> > > iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
> > > dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
> > > n dell_wmi_descriptor firmware_attributes_class wmi_bmof
> > > intel_uncore
> > > einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common
> > > intel_vsec
> > > e1000e macsec mei_me i2c_i801
> > > spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop
> > > nfnetlink
> > > zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto
> > > gpu_sched
> > > polyval_generic rtsx_pci_sdmm
> > > c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm
> > > sha512_ssse3
> > > nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
> > > nvme_core idxd_bus rtsx_pci nvme_au
> > > th pinctrl_alderlake ip6_tables ip_tables fuse
> > > [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell
> > > Tainted:
> > > G W 6.14.0-rc4+ #11
> > > [ 39.848605] Tainted: [W]=WARN
> > > [ 39.848606] Hardware name: Dell Inc. Precision 7960
> > > Tower/01G0M6,
> > > BIOS 2.7.0 12/17/2024
> > > [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
> > > [nouveau]
> > > [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1
> > > 43
> > > 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2
> > > c5
> > > f0 eb 96 <0f> 0b e9 67 ff ff f
> > > f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
> > > [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
> > > [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
> > > ff175a3b4801e008
> > > [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
> > > ff175a3b504da980
> > > [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
> > > 0000000000000001
> > > [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
> > > ff175a3b6d97de00
> > > [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
> > > 0000000000000001
> > > [ 39.848696] FS: 00007fc5477846c0(0000)
> > > GS:ff175a5a50280000(0000)
> > > knlGS:0000000000000000
> > > [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
> > > 0000000000f71ef0
> > > [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> > > 0000000000000400
> > > [ 39.848702] PKRU: 55555554
> > > [ 39.848703] Call Trace:
> > > [ 39.848704] <TASK>
> > > [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848782] ? __warn.cold+0x93/0xfa
> > > [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848861] ? report_bug+0xff/0x140
> > > [ 39.848863] ? handle_bug+0x58/0x90
> > > [ 39.848865] ? exc_invalid_op+0x17/0x70
> > > [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20
> > > [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80
> > > [nouveau]
> > > [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10
> > > [nouveau]
> > > [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0
> > > [ 39.849090] dma_fence_add_callback+0x4b/0xd0
> > > [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau]
> > > [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau]
> > > [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
> > > [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849431] drm_ioctl_kernel+0xad/0x100
> > > [ 39.849433] drm_ioctl+0x288/0x550
> > > [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> > > [ 39.849620] __x64_sys_ioctl+0x94/0xc0
> > > [ 39.849621] do_syscall_64+0x82/0x160
> > > [ 39.849623] ? drm_ioctl+0x2b7/0x550
> > > [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0
> > > [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0
> > > [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> > > [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200
> > > [ 39.849729] ? do_syscall_64+0x8e/0x160
> > > [ 39.849730] ? exc_page_fault+0x7e/0x1a0
> > > [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [ 39.849735] RIP: 0033:0x7fc5576fe0ad
> > > [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45
> > > 10
> > > c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00
> > > 00
> > > 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25
> > > 28
> > > 00 00 00
> > > [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246
> > > ORIG_RAX:
> > > 0000000000000010
> > > [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
> > > 00007fc5576fe0ad
> > > [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
> > > 000000000000000e
> > > [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
> > > 000055cb74e35560
> > > [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
> > > 00007ffc00268960
> > > [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
> > > 000055cb74e3cd10
> > > [ 39.849746] </TASK>
> > > [ 39.849746] ---[ end trace 0000000000000000 ]---
> > > [ 39.849776] ------------[ cut here ]------------
> > >
> > >
> > > This is the first WARN_ON() in dma_fence_set_error(), called by
> > > nouveau_fence_context_kill().
> > >
> > > It's rare, but it is a bug, or rather: the archetype of a race,
> > > since
> > > (as Christian pointed out) nouveau_fence_update() later at some
> > > point
> > > will remove the signaled fence (by signaling it again).
> > >
> > >
> > > P.
> > >
> > >
> > > Philipp Stanner (3):
> > > drm/nouveau: Prevent signaled fences in pending list
> > > drm/nouveau: Remove surplus if-branch
> > > drm/nouveau: Add helper to check base fence
> > >
> > > drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++-----
> > > ----
> > > --
> > > 1 file changed, 18 insertions(+), 14 deletions(-)
> > >
>
More information about the dri-devel
mailing list