[Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Matthew Auld matthew.william.auld at gmail.com
Wed Mar 15 19:15:21 UTC 2023


On Wed, 15 Mar 2023 at 18:41, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 08.03.23 um 13:43 schrieb Steven Rostedt:
> > On Wed, 8 Mar 2023 07:17:38 +0100
> > Christian König <christian.koenig at amd.com> wrote:
> >
> >> What test case/environment do you run to trigger this?
> > I'm running a 32bit x86 qemu instance. Attached is the config.
> >
> > The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
> > and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2
>
> I've started to download that, but it will take about an hour. So I
> tried to avoid that for now.
>
> But looks like there isn't any other way to reproduce this, the code
> seems to work with both amdgpu and radeon.
>
> My suspicion is that we just have a reference count issue in qxl or ttm
> which was never noticed because it didn't caused any problems (except
> for a minor memory corruption).

Why does ttm_bo_cleanup_refs() do a bo_put() at the end? It doesn't
make sense to me. Say if the BO is in the process of being delay freed
(bo->deleted = true), and we just did the kref_init() in
ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
again, this time doing the actual bo->destroy(), which frees the
object. The worker then fires at some later point calling
ttm_bo_delayed_delete(), but the BO has already been freed.

>
> Now you get a rain of warnings because we try to grab the lock in the
> delete worker.
>
> Christian.
>
> >
> > It happened again in another test (it's not 100% reproducible).
> >
> > [   23.234838] ------------[ cut here ]------------
> > [   23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > [   23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
> > [   23.240990] Modules linked in:
> > [   23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [   23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [   23.247900] Workqueue: ttm ttm_bo_delayed_delete
> > [   23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
> > [   23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
> > [   23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
> > [   23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
> > [   23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> > [   23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [   23.264781] Call Trace:
> > [   23.265899]  ? lock_is_held_type+0xbe/0x10c
> > [   23.267434]  ? ttm_bo_delayed_delete+0x30/0x94
> > [   23.268971]  ww_mutex_lock+0x32/0x94
> > [   23.270327]  ttm_bo_delayed_delete+0x30/0x94
> > [   23.271818]  process_one_work+0x21a/0x538
> > [   23.273242]  worker_thread+0x146/0x398
> > [   23.274616]  kthread+0xea/0x10c
> > [   23.275859]  ? process_one_work+0x538/0x538
> > [   23.277312]  ? kthread_complete_and_exit+0x1c/0x1c
> > [   23.278899]  ret_from_fork+0x1c/0x28
> > [   23.280223] irq event stamp: 33
> > [   23.281440] hardirqs last  enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [   23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [   23.286066] softirqs last  enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [   23.288220] softirqs last disabled at (0): [<00000000>] 0x0
> > [   23.289952] ---[ end trace 0000000000000000 ]---
> > [   23.291501] ------------[ cut here ]------------
> > [   23.293027] refcount_t: underflow; use-after-free.
> > [   23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> > [   23.296959] Modules linked in:
> > [   23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G        W          6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [   23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [   23.303642] Workqueue: ttm ttm_bo_delayed_delete
> > [   23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
> > [   23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
> > [   23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
> > [   23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
> > [   23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> > [   23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [   23.319859] Call Trace:
> > [   23.320978]  ttm_bo_delayed_delete+0x8c/0x94
> > [   23.322492]  process_one_work+0x21a/0x538
> > [   23.323959]  worker_thread+0x146/0x398
> > [   23.325353]  kthread+0xea/0x10c
> > [   23.326609]  ? process_one_work+0x538/0x538
> > [   23.328081]  ? kthread_complete_and_exit+0x1c/0x1c
> > [   23.329683]  ret_from_fork+0x1c/0x28
> > [   23.331011] irq event stamp: 33
> > [   23.332251] hardirqs last  enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [   23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [   23.336176] softirqs last  enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [   23.337904] softirqs last disabled at (0): [<00000000>] 0x0
> > [   23.339313] ---[ end trace 0000000000000000 ]---
> >
> > -- Steve
>


More information about the amd-gfx mailing list