[BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Christian König christian.koenig at amd.com
Wed Mar 15 18:25:10 UTC 2023


Am 15.03.23 um 18:54 schrieb Steven Rostedt:
> On Wed, 15 Mar 2023 11:57:12 -0400
> Steven Rostedt <rostedt at goodmis.org> wrote:
>
>> The WARN_ON triggered:
>>
>> [   21.481449] mpls_gso: MPLS GSO support
>> [   21.488795] IPI shorthand broadcast: enabled
>> [   21.488873] ------------[ cut here ]------------
>> [   21.490101] ------------[ cut here ]------------
>>
>> [   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc  <<<---- Line of the added WARN_ON()
>>
>> [   21.492940] refcount_t: underflow; use-after-free.
>> [   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
>> [   21.496116] Modules linked in:
>> [   21.497197] Modules linked in:

The problem here is that two backtraces mix together. So it's pretty 
hard to figure out what's going on.


>> [   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [   21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
>> [   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [   21.504089] Workqueue: ttm ttm_bo_delayed_delete
>> [   21.507196] Workqueue: events drm_fb_helper_damage_work
>> [   21.509235]
>> [   21.510291] registered taskstats version 1
>> [   21.510302] Running ring buffer tests...
>> [   21.511792]
>> [   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
>> [   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
>> [   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
>> [   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
>> [   21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
>> [   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
>> [   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
>> [   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
>> [   21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [   21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [   21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
>> [   21.526166] Call Trace:
>> [   21.526189]  ? ww_mutex_unlock+0x3a/0x94
>> [   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
>> [   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
>> [   21.533114] Call Trace:
>> [   21.534516]  ttm_mem_evict_first+0x3d3/0x568
>> [   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
>> [   21.537391]  ? kfree+0x6b/0xdc
>> [   21.538901]  process_one_work+0x21a/0x484
>> [   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
>> [   21.540854]  worker_thread+0x14a/0x39c
>> [   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
>> [   21.543332]  kthread+0xea/0x10c
> So I triggered it again, and the same backtrace is there.
>
>> [   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4
> It looks like the object is being reserved before it's fully removed. And
> it's somewhere in this tty_bo_mem_space() (which comes from the
> qxl_bo_create()).
>
> I don't know this code at all, nor do I have any idea of what it's trying
> to do. All I know is that this is triggering often (not always), and it has
> to do with some race.
>
> Now my config has lots of debugging enabled, which slows down the system
> quite a bit. This also happens to open up race windows. Just because your
> testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
> just likely to be very hard to hit.
>
>> [   21.544942]  ? process_one_work+0x484/0x484
>> [   21.545887]  ttm_bo_validate+0xc5/0x19c
>> [   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
>> [   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
>> [   21.548716]  ret_from_fork+0x1c/0x28
>> [   21.549650]  qxl_bo_create+0x145/0x20c
> Here's the latest backtrace:
>
> [  170.817449] ------------[ cut here ]------------
> [  170.817455] ------------[ cut here ]------------
> [  170.818210] refcount_t: underflow; use-after-free.
> [  170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> [  170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x278/0x2c8
> [  170.820124] Modules linked in:
> [  170.822127] Modules linked in:
> [  170.823829]
> [  170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
> [  170.824610]
> [  170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [  170.825124] Workqueue: ttm ttm_bo_delayed_delete
> [  170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
> [  170.826996]
> [  170.827367] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [  170.828839] EIP: refcount_warn_saturate+0xb6/0xfc
> [  170.829842] Workqueue: events drm_fb_helper_damage_work
> [  170.831325] Code: 68 50 ab ef d3 e8 76 ce b2 ff 0f 0b 58 c9 c3 90 80 3d 4e e0 18 d4 00 75 8a c6 05 4e e0 18 d4 01 68 7c ab ef d3 e8 56 ce b2 ff <0f> 0b 59 c9 c3 80 3d 4c e0 18 d4 00 0f 85 67 ff ff ff c6 05 4c e0
> [  170.831670]
> [  170.833138] EAX: 00000026 EBX: c2aa6ef4 ECX: 00000002 EDX: 80000002
> [  170.834080] EIP: ttm_bo_release+0x278/0x2c8
> [  170.834910] ESI: c27744e0 EDI: f758eaa8 EBP: c3083f34 ESP: c3083f30
> [  170.838453] Code: 00 90 89 f8 e9 91 fe ff ff 90 0f 0b 8b 43 10 85 c0 0f 84 d5 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 d0 fd ff ff 8d 76 00 <0f> 0b e9 c6 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [  170.838714] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [  170.839872] EAX: d3835cf0 EBX: c2aa6ed8 ECX: c2aa6ed8 EDX: 00000001
> [  170.840546] CR0: 80050033 CR2: 005ce01c CR3: 03afa000 CR4: 00150ef0
> [  170.841669] ESI: c2aa6e00 EDI: c11333d8 EBP: c3b15ca0 ESP: c3b15c68
> [  170.844583] Call Trace:
> [  170.845738] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [  170.846732]  ttm_bo_delayed_delete+0x7c/0x84
> [  170.847766] CR0: 80050033 CR2: 005b59b1 CR3: 03b12000 CR4: 00150ef0
> [  170.848761]  process_one_work+0x18a/0x350
> [  170.849177] Call Trace:
> [  170.850250]  worker_thread+0x136/0x378
> [  170.850942]  ? ttm_resource_free+0x57/0x6c
> [  170.851928]  ? _raw_spin_unlock_irqrestore+0x25/0x40
> [  170.852583]  ? ttm_bo_cleanup_memtype_use+0x5f/0x64
> [  170.853002]  kthread+0xda/0xfc
> [  170.853608]  ? ttm_bo_cleanup_refs+0xbb/0x1c8
> [  170.854266]  ? rescuer_thread+0x340/0x340
> [  170.855058]  ttm_mem_evict_first+0x324/0x3a0
> [  170.855840]  ? kthread_complete_and_exit+0x1c/0x1c
> [  170.856341]  ttm_bo_mem_space+0x1d8/0x1ec
> [  170.857056]  ret_from_fork+0x1c/0x28
> [  170.857702]  ttm_bo_validate+0xb3/0x13c
> [  170.858393] ---[ end trace 0000000000000000 ]---
> [  170.859167]  ttm_bo_init_reserved+0x12e/0x1cc
> [  170.863261] systemd[1]: Starting Create System Users...
> [  170.863964]  qxl_bo_create+0x14a/0x1f8
> [  170.892037]  ? qxl_ttm_debugfs_init+0x48/0x48
> [  170.892043]  qxl_alloc_bo_reserved+0x39/0x98
>           Starting ESC[0;1;39mCreat[  170.892047]  ? kmalloc_trace+0x22/0x84
> e System UsersESC[0m...
> [  170.895826]  qxl_image_alloc_objects+0x91/0xfc
> [  170.897122]  qxl_draw_dirty_fb+0x159/0x440
> [  170.898207] systemd[1]: Finished Set the console keyboard layout.
> [  170.898329]  qxl_framebuffer_surface_dirty+0xfb/0x1cc
> [  170.901528]  ? qxl_create_plane+0xc4/0xc4
> [  170.901533]  drm_fbdev_fb_dirty+0x129/0x1ac
> [ESC[0;32m  OK  ESC[0m] Finished ESC[0[  170.901537]  drm_fb_helper_damage_work+0x8f/0x158
> ;1;39mSet the console keyboard layoutESC[0m.
> [  170.905435]  process_one_work+0x18a/0x350
> [  170.906578]  worker_thread+0x136/0x378
> [  170.906832] systemd[1]: Finished Load/Save Random Seed.
> [  170.907195]  ? _raw_spin_unlock_irqrestore+0x25/0x40
> [  170.910267]  kthread+0xda/0xfc
> [  170.910271]  ? rescuer_thread+0x340/0x340
> [  170.910275]  ? kthread_complete_and_exit+0x1c/0x1c
> [  170.910278]  ret_from_fork+0x1c/0x28
> [  170.910282] ---[ end trace 0000000000000000 ]---

That looks like an eviction is racing with finishing this BO. Let me 
take a closer look.

It would help if you replace the WARN_ON(bo->delayed_delete.func) with 
an "if (WARN_ON(bo->delayed_delete.func)) return;" and disable the 
refcount warning.

This way we should not get other warnings except for the one we are 
looking for.

Regards,
Christian.


>
> -- Steve



More information about the amd-gfx mailing list