[Intel-gfx] [BUG?] INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 3 jiffies s: 309 root: 0x1/.

Thu Apr 13 14:43:58 UTC 2023

On Thu, Apr 13, 2023 at 08:30:02AM +0100, Rui Salvaterra wrote:
> Hi again, everyone.
> 
> So, while preparing to file the bug report with the requested
> information, I got a trace completely unrelated to DRM (on a swapon
> call, it seems).
> 
> [    4.868340] rcu: INFO: rcu_sched detected expedited stalls on
> CPUs/tasks: { 4-.... } 3 jiffies s: 265 root: 0x10/.
> [    4.868349] rcu: blocking rcu_node structures (internal RCU debug):
> [    4.868351] Sending NMI from CPU 3 to CPUs 4:
> [    4.868355] NMI backtrace for cpu 4
> [    4.868357] CPU: 4 PID: 462 Comm: swapon Not tainted 6.3.0-rc6-debug+ #57
> [    4.868359] Hardware name: Apple Inc.
> Macmini6,2/Mac-F65AE981FFA204ED, BIOS 429.0.0.0.0 03/18/2022
> [    4.868360] RIP: 0010:zram_submit_bio+0x57c/0x940
> [    4.868365] Code: 04 4c 01 f0 48 8d 48 08 f0 48 0f ba 68 08 0d 0f
> 82 80 00 00 00 4c 89 ef e8 01 eb ff ff 49 8b 45 00 4a 8d 44 30 09 f0
> 80 20 df <f0> 48 ff 45 00 48 81 eb 00 10 00 00 41 83 c4 01 48 81 fb ff
> 0f 00
> [    4.868366] RSP: 0018:ffff8881057dbcd8 EFLAGS: 00000246
> [    4.868368] RAX: ffffc90001c186d9 RBX: 000000003e893000 RCX: ffffc90001c186d8
> [    4.868369] RDX: ffffc90001c186d0 RSI: 0000000000000000 RDI: ffff88810083b400
> [    4.868369] RBP: ffff88810083b470 R08: 0000000000027e40 R09: 0000000000025850
> [    4.868370] R10: 000000000014b212 R11: ffff88810ba03180 R12: 00000000000c176d
> [    4.868371] R13: ffff88810083b400 R14: 0000000000c176d0 R15: 0000000000000000
> [    4.868372] FS:  00007fbd8f8ce800(0000) GS:ffff888266100000(0000)
> knlGS:0000000000000000
> [    4.868373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    4.868374] CR2: 0000563005371000 CR3: 000000010355c003 CR4: 00000000001706e0
> [    4.868375] Call Trace:
> [    4.868377]  <TASK>
> [    4.868378]  ? block_read_full_folio+0x23e/0x2e0
> [    4.868383]  ? kmem_cache_alloc+0x1b/0x110
> [    4.868385]  ? mempool_alloc+0x37/0x140
> [    4.868388]  ? pcpu_block_update_hint_alloc+0xce/0x2f0
> [    4.868390]  __submit_bio+0x41/0xd0
> [    4.868394]  submit_bio_noacct_nocheck+0xc4/0x2b0
> [    4.868396]  blk_next_bio+0x55/0x70
> [    4.868398]  __blkdev_issue_discard+0xc8/0x180
> [    4.868401]  blkdev_issue_discard+0x3c/0x80
> [    4.868403]  __x64_sys_swapon+0xb71/0x1120
> [    4.868407]  do_syscall_64+0x2b/0x50
> [    4.868410]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [    4.868414] RIP: 0033:0x7fbd8f712d5b
> [    4.868416] Code: 73 01 c3 48 8b 0d bd 30 0e 00 f7 d8 64 89 01 48
> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a7 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 30 0e 00 f7 d8 64 89
> 01 48
> [    4.868417] RSP: 002b:00007ffcaf9a3448 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000a7
> [    4.868418] RAX: ffffffffffffffda RBX: 0000000000018064 RCX: 00007fbd8f712d5b
> [    4.868419] RDX: 0000000000018064 RSI: 0000000000018064 RDI: 000056300535fb10
> [    4.868420] RBP: 00007ffcaf9a3530 R08: 000000014b213000 R09: 00007fbd8f7f70f0
> [    4.868420] R10: 0000000000001000 R11: 0000000000000246 R12: 000056300535fb10
> [    4.868421] R13: 0000000000000064 R14: 00007ffcaf9a3530 R15: 0000000000000000
> [    4.868423]  </TASK>
> 
> Could it be that RCU is reporting expedited stalls too eagerly? And,
> if so, why only on this machine?

My guess would be that you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to
some small non-zero number, for example, you might have set up a recent
Android .config or some such.  The default of zero would give you about
21 seconds rather than the three jiffies that you are seeing.

Could you please check your .config?

							Thanx, Paul