[bugreport] [5.10] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock != ww) We 'forgot' to unlock everything else first?

Mikhail Gavrilov mikhail.v.gavrilov at gmail.com
Sat Oct 17 22:33:17 UTC 2020


Hi folks.
I observed this issue since 5.3 and it still happens with 5.10 git.
This warning has reproductivity 100% reliable when I launch
"Wolfenstein: Youngblood" version of Mesa doesn't matter.

[73690.883948] ------------[ cut here ]------------
[73690.883953] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock != ww)
[73690.883963] WARNING: CPU: 30 PID: 194867 at
kernel/locking/mutex.c:327 __ww_mutex_lock.constprop.0+0xe96/0xef0
[73690.883966] Modules linked in: tun snd_seq_dummy snd_hrtimer uinput
rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp
nf_conntrack_tftp bridge stp llc nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set
nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac
bnep sunrpc vfat fat snd_hda_codec_realtek mt76x2u mt76x2_common
snd_hda_codec_generic mt76x02_usb ledtrig_audio snd_hda_codec_hdmi
mt76_usb mt76x02_lib snd_hda_intel uvcvideo iwlmvm snd_intel_dspcfg
mt76 gspca_zc3xx snd_hda_codec gspca_main joydev videobuf2_vmalloc
snd_usb_audio btusb edac_mce_amd videobuf2_memops snd_hda_core
videobuf2_v4l2 snd_usbmidi_lib kvm_amd btrtl videobuf2_common btbcm
snd_hwdep
[73690.884036]  snd_rawmidi mac80211 btintel snd_seq videodev
snd_seq_device eeepc_wmi libarc4 bluetooth kvm xpad ff_memless snd_pcm
mc iwlwifi asus_wmi irqbypass sparse_keymap ecdh_generic rapl ecc
sp5100_tco video wmi_bmof snd_timer pcspkr snd k10temp i2c_piix4
soundcore cfg80211 rfkill acpi_cpufreq binfmt_misc zram ip_tables
hid_logitech_hidpp hid_logitech_dj amdgpu iommu_v2 gpu_sched ttm
drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec drm
ghash_clmulni_intel ccp igb nvme dca i2c_algo_bit nvme_core wmi
pinctrl_amd fuse
[73690.884094] CPU: 30 PID: 194867 Comm: Youngblood_x64v Not tainted
5.10.0-0.rc0.20201014gitb5fc7a89e58b.42.fc34.x86_64 #1
[73690.884097] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 2606 08/13/2020
[73690.884100] RIP: 0010:__ww_mutex_lock.constprop.0+0xe96/0xef0
[73690.884103] Code: f2 89 5b 9e 48 c7 c7 1d bb 59 9e e8 ef f6 f8 ff
0f 0b e9 2a fc ff ff 48 c7 c6 d4 89 5b 9e 48 c7 c7 1d bb 59 9e e8 d5
f6 f8 ff <0f> 0b e9 e9 fe ff ff 83 3d 44 3d 81 02 00 75 07 48 83 7d 28
00 75
[73690.884106] RSP: 0018:ffffa1c5d079f8f0 EFLAGS: 00010286
[73690.884108] RAX: 0000000000000032 RBX: 0000000000000001 RCX: ffff8c650a7db178
[73690.884111] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8c650a7db170
[73690.884112] RBP: ffffa1c5d079fc38 R08: 0000000000000000 R09: 0000000000000000
[73690.884114] R10: ffffa1c5d079f720 R11: ffff8c652e2fffe8 R12: ffff8c600cd42990
[73690.884116] R13: ffff8c5f055f0000 R14: ffff8c600cd42a00 R15: 0000000000000000
[73690.884119] FS:  00000000060e3640(0000) GS:ffff8c650a600000(0000)
knlGS:000000013ffc0000
[73690.884121] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[73690.884122] CR2: 00007fe25431d010 CR3: 000000011916e000 CR4: 0000000000350ee0
[73690.884124] Call Trace:
[73690.884136]  ? ttm_mem_evict_first+0x212/0x4f0 [ttm]
[73690.884139]  ? __schedule+0x345/0xa80
[73690.884144]  ww_mutex_lock_interruptible+0x43/0xb0
[73690.884149]  ttm_mem_evict_first+0x212/0x4f0 [ttm]
[73690.884157]  ttm_bo_mem_space+0x30f/0x340 [ttm]
[73690.884164]  ttm_bo_validate+0x12b/0x1d0 [ttm]
[73690.884169]  ? sched_clock+0x5/0x10
[73690.884261]  amdgpu_cs_bo_validate+0x8b/0x190 [amdgpu]
[73690.884350]  amdgpu_cs_list_validate+0x10e/0x150 [amdgpu]
[73690.884435]  amdgpu_cs_ioctl+0x7f4/0x1ed0 [amdgpu]
[73690.884531]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
[73690.884550]  drm_ioctl_kernel+0x8c/0xe0 [drm]
[73690.884563]  drm_ioctl+0x20f/0x3a0 [drm]
[73690.884623]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
[73690.884625]  ? sched_clock+0x5/0x10
[73690.884628]  ? sched_clock_cpu+0xc/0xb0
[73690.884631]  ? lockdep_hardirqs_on_prepare+0xff/0x180
[73690.884632]  ? _raw_spin_unlock_irqrestore+0x41/0x50
[73690.884684]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[73690.884688]  __x64_sys_ioctl+0x83/0xb0
[73690.884691]  do_syscall_64+0x33/0x40
[73690.884693]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[73690.884695] RIP: 0033:0x7fe3209e64cb
[73690.884697] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c
ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d b9 0c 00 f7 d8 64 89
01 48
[73690.884699] RSP: 002b:00000000060db248 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[73690.884701] RAX: ffffffffffffffda RBX: 00000000060db2d0 RCX: 00007fe3209e64cb
[73690.884702] RDX: 00000000060db2d0 RSI: 00000000c0186444 RDI: 00000000000000d4
[73690.884703] RBP: 00000000c0186444 R08: 00007fe1bd653780 R09: 00000000060db290
[73690.884705] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe17d82a120
[73690.884706] R13: 00000000000000d4 R14: 00007fe3083ff6f0 R15: 00007fe1bd653720
[73690.884710] irq event stamp: 2315027
[73690.884712] hardirqs last  enabled at (2315027):
[<ffffffff9dd01ac4>] _raw_spin_unlock_irq+0x24/0x40
[73690.884714] hardirqs last disabled at (2315026):
[<ffffffff9dcfb815>] __schedule+0x685/0xa80
[73690.884715] softirqs last  enabled at (2309462):
[<ffffffff9de010d2>] asm_call_irq_on_stack+0x12/0x20
[73690.884717] softirqs last disabled at (2309387):
[<ffffffff9de010d2>] asm_call_irq_on_stack+0x12/0x20
[73690.884718] ---[ end trace 29c797a32d5422cf ]---

$ git blame -L 322,332 kernel/locking/mutex.c
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 322)   if
(ww_ctx->contending_lock) {
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 323)           /*
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 324)
  * After -EDEADLK you tried to
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 325)
  * acquire a different ww_mutex? Bad!
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 326)            */
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 327)
 DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock != ww);
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 328)
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 329)           /*
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 330)
  * You called ww_mutex_lock after receiving -EDEADLK,
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 331)
  * but 'forgot' to unlock everything else first?
76916515d9d84 (Davidlohr Bueso 2014-07-30 13:41:53 -0700 332)            */

So I think it really not good if we 'forgot' to unlock everything else first.

So looking next...

$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname
-r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_mem_evict_first+0x212
ttm_mem_evict_first+0x212/0x4f0:
dma_resv_lock_interruptible at
/usr/src/debug/kernel-20201014gitb5fc7a89e58b/linux-5.10.0-0.rc0.20201014gitb5fc7a89e58b.42.fc34.x86_64/./include/linux/dma-resv.h:134
(inlined by) ttm_mem_evict_wait_busy at
/usr/src/debug/kernel-20201014gitb5fc7a89e58b/linux-5.10.0-0.rc0.20201014gitb5fc7a89e58b.42.fc34.x86_64/drivers/gpu/drm/ttm/ttm_bo.c:756
(inlined by) ttm_mem_evict_first at
/usr/src/debug/kernel-20201014gitb5fc7a89e58b/linux-5.10.0-0.rc0.20201014gitb5fc7a89e58b.42.fc34.x86_64/drivers/gpu/drm/ttm/ttm_bo.c:822

$ git blame -L 125,140 include/linux/dma-resv.h
Blaming lines:   5% (16/293), done.
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 125)  *
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 126)  * As the reservation object may be locked by
multiple parties in an
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 127)  * undefined order, a #ww_acquire_ctx is passed to
unwind if a cycle
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 128)  * is detected. See ww_mutex_lock() and
ww_acquire_init(). A reservation
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 129)  * object may be locked by itself by passing NULL
as @ctx.
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 130)  */
52791eeec1d9f include/linux/dma-resv.h    (Christian König 2019-08-11
10:06:32 +0200 131) static inline int
dma_resv_lock_interruptible(struct dma_resv *obj,
52791eeec1d9f include/linux/dma-resv.h    (Christian König 2019-08-11
10:06:32 +0200 132)                                             struct
ww_acquire_ctx *ctx)
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 133) {
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 134)       return
ww_mutex_lock_interruptible(&obj->lock, ctx);
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 135) }
5d276a1acaa80 include/linux/reservation.h (Christian König 2017-11-09
09:59:05 +0100 136)
0dbd555a011c2 include/linux/reservation.h (Christian König 2019-07-31
09:41:50 +0200 137) /**
52791eeec1d9f include/linux/dma-resv.h    (Christian König 2019-08-11
10:06:32 +0200 138)  * dma_resv_lock_slow - slowpath lock the
reservation object
0dbd555a011c2 include/linux/reservation.h (Christian König 2019-07-31
09:41:50 +0200 139)  * @obj: the reservation object
0dbd555a011c2 include/linux/reservation.h (Christian König 2019-07-31
09:41:50 +0200 140)  * @ctx: the locking context


$ git blame -L 750,760 drivers/gpu/drm/ttm/ttm_bo.c
Blaming lines:   0% (11/1869), done.
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 750)   int r;
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 751)
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 752)   if
(!busy_bo || !ticket)
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 753)
 return -EBUSY;
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 754)
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 755)   if
(ctx->interruptible)
52791eeec1d9f (Christian König 2019-08-11 10:06:32 +0200 756)
 r = dma_resv_lock_interruptible(busy_bo->base.resv,
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 757)
                                           ticket);
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 758)   else
52791eeec1d9f (Christian König 2019-08-11 10:06:32 +0200 759)
 r = dma_resv_lock(busy_bo->base.resv, ticket);
d367bd2a5e2b1 (Christian König 2019-05-22 09:51:47 +0200 760)

Christian, you are the author of their lines of code.
Alex, Michel you were made a review and tested this code.
I invited you here, because you, like no one in this world, know what
is happening here, can you clarify the situation.

Full dmesg here: https://pastebin.com/ZK9F4SbR

--
Best Regards,
Mike Gavrilov.


More information about the amd-gfx mailing list