[BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Wed Mar 15 15:09:49 UTC 2023

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König <christian.koenig at amd.com> wrote:

> Am 08.03.23 um 03:26 schrieb Steven Rostedt:
> > On Tue, 7 Mar 2023 21:22:23 -0500
> > Steven Rostedt <rostedt at goodmis.org> wrote:
> >  
> >> Looks like there was a lock possibly used after free. But as commit
> >> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
> >> workers") changed a lot of this code, I figured it may be the culprit.  
> > If I bothered to look at the second warning after this one (I usually stop
> > after the first), it appears to state there was a use after free issue.  
> 
> Yeah, that looks like the reference count was somehow messed up.
> 
> What test case/environment do you run to trigger this?
> 
> Thanks for the notice,

I'm still getting this on Linus's latest tree.

[  230.530222] ------------[ cut here ]------------
[  230.569795] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[  230.569957] WARNING: CPU: 0 PID: 212 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x62a/0x1300
[  230.612599] Modules linked in:
[  230.632144] CPU: 0 PID: 212 Comm: kworker/0:8H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #992
[  230.654939] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  230.678866] Workqueue: ttm ttm_bo_delayed_delete
[  230.699452] EIP: __ww_mutex_lock.constprop.0+0x62a/0x1300
[  230.720582] Code: e8 3b 9a 95 ff 85 c0 0f 84 61 fa ff ff 8b 0d 58 bc 3a c4 85 c9 0f 85 53 fa ff ff 68 54 98 06 c4 68 b7 b6 04 c4 e8 46 af 40 ff <0f> 0b 58 5a e9 3b fa ff ff 8d 74 26 00 90 a1 ec 47 b0 c4 85 c0 75
[  230.768336] EAX: 00000028 EBX: 00000000 ECX: c51abdd8 EDX: 00000002
[  230.792001] ESI: 00000000 EDI: c53856bc EBP: c51abf00 ESP: c51abeac
[  230.815944] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[  230.840033] CR0: 80050033 CR2: ff9ff000 CR3: 04506000 CR4: 00150ef0
[  230.864059] Call Trace:
[  230.886369]  ? ttm_bo_delayed_delete+0x30/0x94
[  230.909902]  ww_mutex_lock+0x32/0x94
[  230.932550]  ttm_bo_delayed_delete+0x30/0x94
[  230.955798]  process_one_work+0x21a/0x484
[  230.979335]  worker_thread+0x14a/0x39c
[  231.002258]  kthread+0xea/0x10c
[  231.024769]  ? process_one_work+0x484/0x484
[  231.047870]  ? kthread_complete_and_exit+0x1c/0x1c
[  231.071498]  ret_from_fork+0x1c/0x28
[  231.094701] irq event stamp: 4023
[  231.117272] hardirqs last  enabled at (4023): [<c3d1df99>] _raw_spin_unlock_irqrestore+0x2d/0x58
[  231.143217] hardirqs last disabled at (4022): [<c31d5a55>] kvfree_call_rcu+0x155/0x2ec
[  231.166058] softirqs last  enabled at (3460): [<c3d1f403>] __do_softirq+0x2c3/0x3bb
[  231.183104] softirqs last disabled at (3455): [<c30c96a9>] call_on_stack+0x45/0x4c
[  231.200336] ---[ end trace 0000000000000000 ]---
[  231.216572] ------------[ cut here ]------------

This is preventing me from adding any of my own patches on v6.3-rcX due to
this bug failing my tests. Which means I can't add anything to linux-next
until this is fixed!

-- Steve
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config-fail
Type: application/octet-stream
Size: 155555 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230315/e9688e2c/attachment-0001.obj>